CSE 40771 - Distributed Systems - Spring 2023
This assignment will lay down a few more key capabilities needed to run in a distributed system. First, you will modify the client and server to discover each other via an online naming service. Second, you will improve the robustness of the client to deal with common failures cleanly.
To this point, your server has listened on a manually-selected port number in a fixed location. This is ok for testing purposes, but becomes a problem when running multiple servers across many machines. The client needs a better way of locating the server that it wants to access.
We will address this problem by making use of a simple name server that we have running here at Notre Dame: take a look at catalog.cse.nd.edu:9097. Here is how the name server works:
Various services running at ND (and around the world) periodically register
themselves with the name server by sending a UDP packet to catalog.cse.nd.edu:9097
,
typically once every five minutes. The UDP packet contains a JSON document
that describes the essential properties of the service: name, port, location, memory, disk, etc.
Some of the services are quite simple, while others are very complex.
The name server publishes the set of known services via a web page. You can browse an HTML representation of the web page manually, or you can access a JSON representation programmatically, like this:
curl http://catalog.cse.nd.edu:9097/query.json | json_pp
The name server periodically discards records from services that have not sent an update in the last 15 minutes. This is a garbage collection measure to ensure that records don’t accumulate forever. So, servers must periodically refresh their state, and clients must accept the fact that any data in the name server is necessarily “stale”.
For this assignment, you must modify your client and server to use the name server as follows:
catalog.cse.nd.edu:9097
. The type
field
should be hashtable
, the port
field should be the port the server is listening on,
the owner
field should be your netid, and the project
field should be a personalized
server name that you pass in on the command line.
(Note that the name
field is automatically filled in with the host name by the name server.)
As an example:{
"type" : "hashtable",
"owner" : "YOURNETID",
"port" : 1234,
"project" : "YOURNETID-a4-test5"
}
type=="hashtable"
and project
equal to your
server name. Then, connect to the indicated host and port number.Modify your client and server to make use of a project name on the command line. Your server should be started like this:
python HashTableServer.py YOURNETID-a4
And then start your client using the same project name:
python TestPerf.py YOURNETID-a4
Your client and server will now be able to find each other, no matter where they are located.
Take Care: To avoid collisions between students, please make sure to use a project name that contains your netid. (Honor system.)
To this point, your client has done the straightforward thing of connecting, sending a request, and waiting for a result. However, there are a variety of things that may go wrong in this sequence: the network could be interrupted, the server could crash, or (even worse) the server might get stuck and not send any response, causing the client to wait forever.
Modify HashTableClient so that if any of these undesired events happen:
then HashTableClient should print a short message, take a pause, reconnect, and try again. Note that the pause is important: if the client action fails quickly, you shouldn’t flood the network with rapid retries. A good policy is exponential backoff: wait one second after the first failure, then two seconds, four seconds, etc. until success is achieved. (The next failure should wait one second, two seconds.) From the caller’s perspective, these retries should be completely invisible. It should look like the single function call just took a little longer before completing correctly.
(Now you see why the operations must be idempotent!)
Once you have this working, test it by starting, stopping, and killing the client and server in arbitrary combinations. Whatever happens, you should observe that the client always (after a brief delay) succeeds in completing its operations, and never returns a failure to the caller. Then move on to the next step:
Repeat the tests and measurements from the prior assignment:
However, this time, kill and restart the server in the middle of each test. The client program should briefly pause, reconnect, and then keep going without losing any results.
Please review the general instructions for submitting assignments.
Turn in all of your source code, along with a lab report titled REPORT
that describes the following in detail:
TestPerf
TestOutliers