CSE 40771 - Distributed Systems - Fall 2024

View the Project on GitHub

A4 - Naming and Robustness

Please review the general instructions for preparing assignments.

This assignment will lay down a few more key capabilities needed to run in a distributed system. First, you will modify the client and server to discover each other via an online naming service. Second, you will improve the robustness of the client to deal with common failures cleanly.

Part 1: Discovery via a Name Server

To this point, your server has listened on a manually-selected port number in a fixed location. This is ok for testing purposes, but becomes a problem when running multiple servers across many machines. The client needs a better way of locating the server that it wants to access.

We will address this problem by making use of a simple name server that we have running here at Notre Dame: take a look at catalog.cse.nd.edu:9097. Here is how the name server works:

Various services running at ND (and around the world) periodically register themselves with the name server by sending a UDP packet to catalog.cse.nd.edu:9097, typically once every five minutes. The UDP packet contains a JSON document that describes the essential properties of the service: name, port, location, memory, disk, etc. Some of the services are quite simple, while others are very complex.

The name server publishes the set of known services via a web page. You can browse an HTML representation of the web page manually, or you can access a JSON representation programmatically, like this:

curl http://catalog.cse.nd.edu:9097/query.json | json_pp

The name server periodically discards records from services that have not sent an update in the last 15 minutes. This is a garbage collection measure to ensure that records don’t accumulate forever. So, servers must periodically refresh their state, and clients must accept the fact that any data in the name server is necessarily “stale”.

For this assignment, you must modify your client and server to use the name server as follows:

{
"type" : "spreadsheet",
"owner" : "YOURNETID",
"port" : 1234,
"project" : "YOURNETID-a4-test5",
"width" : 120,
"height" : 16
}

Modify your client and server to make use of a project name on the command line. Your server should be started like this:

python SpreadSheetServer.py YOURNETID-a4

And then start your client using the same project name:

python TestPerf.py YOURNETID-a4

Your client and server will now be able to find each other, no matter where they are located.

Take Care: To avoid collisions between students, please make sure to use a project name that contains your netid. (Honor system.)

Part 2: Client Robustness

To this point, your client has done the straightforward thing of connecting, sending a request, and waiting for a result. However, there are a variety of things that may go wrong in this sequence: the network could be interrupted, the server could crash, or (even worse) the server might get stuck and not send any response, causing the client to wait forever.

Modify SpreadSheetClient so that if any of these undesired events happen:

then SpreadSheetClient should print a short message, take a pause, reconnect, and try again. Note that the pause is important: if the client action fails quickly, you shouldn’t flood the network with rapid retries. A good policy is exponential backoff: wait one second after the first failure, then two seconds, four seconds, etc. until success is achieved. From the caller’s perspective, these retries should be completely invisible. It should look like the single function call just took a little longer before completing correctly.

(Now you see why the operations must be idempotent!)

Once you have this working, test it by starting, stopping, and killing the client and server in arbitrary combinations. Whatever happens, you should observe that the client always (after a brief delay) succeeds in completing its operations, and never returns a failure to the caller. Then move on to the next step:

Testing and Measurement

Repeat the tests and measurements from the prior assignment:

However, this time, kill and restart the server in the middle of each test. The client program should briefly pause, reconnect, and then keep going without losing any results.

Turning In

Please review the general instructions for submitting assignments.

Turn in all of your source code, along with a lab report titled REPORT that describes the following in detail: