A4 - Naming and Robustness

Overview

This assignment will lay down a few more key capabilities needed to run in a distributed system. First, you will modify the client and server to discover each other via an online naming service. Second, you will improve the robustness of the client to deal with common failures cleanly.

Part 1: Discovery via a Name Server

To this point, your server has listened on a manually-selected port number in a fixed location. This is ok for testing purposes, but becomes a problem when running multiple servers across many machines. The client needs a better way of locating the server that it wants to access.

We will address this problem by making use of a simple name server that we have running here at Notre Dame: take a look at catalog.cse.nd.edu:9097. Here is how the name server works:

Various services running at ND (and around the world) periodically register themselves with the name server by sending a UDP packet to catalog.cse.nd.edu:9097, typically once every five minutes. The UDP packet contains a JSON document that describes the essential properties of the service: name, port, location, memory, disk, etc. Some of the services are quite simple, while others are very complex.

The name server publishes the set of known services via a web page. You can browse an HTML representation of the web page manually, or you can access a JSON representation programmatically, like this:

curl http://catalog.cse.nd.edu:9097/query.json | json_pp

The name server periodically discards records from services that have not sent an update in the last 15 minutes. This is a garbage collection measure to ensure that records don’t accumulate forever. So, servers must periodically refresh their state, and clients must accept the fact that any data in the name server is necessarily “stale”.

For this assignment, you must modify your client and server to use the name server as follows:

First, the server should listen on any available TCP port (indicated by port zero) and then determine what port was actually allocated. Then, at startup and once per minute thereafter, it should send an update describing itself to the name server. This is done by sending a UDP message containing a text JSON document to catalog.cse.nd.edu:9097. The type field should be hashtable, the port field should be the port the server is listening on, the owner field should be your netid, and the project field should be a personalized server name that you pass in on the command line. (Note that the name field is automatically filled in with the host name by the name server.) As an example:

{
"type" : "hashtable",
"owner" : "YOURNETID",
"port" : 1234,
"project" : "YOURNETID-a4-test5"
}

Second, your client should locate the desired server by name. To do so, make an HTTP connection to the catalog server, download the JSON representation of the service information, and find the entry that has type=="hashtable" and project equal to your server name. Then, connect to the indicated host and port number.

Modify your client and server to make use of a project name on the command line. Your server should be started like this:

python HashTableServer.py YOURNETID-a4

And then start your client using the same project name:

python TestPerf.py YOURNETID-a4

Your client and server will now be able to find each other, no matter where they are located.

Take Care: To avoid collisions between students, please make sure to use a project name that contains your netid. (Honor system.)

Part 2: Client Robustness

To this point, your client has done the straightforward thing of connecting, sending a request, and waiting for a result. However, there are a variety of things that may go wrong in this sequence: the network could be interrupted, the server could crash, or (even worse) the server might get stuck and not send any response, causing the client to wait forever.

Modify HashTableClient so that if any of these undesired events happen:

Failure to lookup the name in the catalog.
Failure to connect to the server.
Failure to send a request.
Failure to read a response within five seconds.

then HashTableClient should print a short message, take a pause, reconnect, and try again. Note that the pause is important: if the client action fails quickly, you shouldn’t flood the network with rapid retries. A good policy is exponential backoff: wait one second after the first failure, then two seconds, four seconds, etc. until success is achieved. (The next failure should wait one second, two seconds.) From the caller’s perspective, these retries should be completely invisible. It should look like the single function call just took a little longer before completing correctly.

(Now you see why the operations must be idempotent!)

Once you have this working, test it by starting, stopping, and killing the client and server in arbitrary combinations. Whatever happens, you should observe that the client always (after a brief delay) succeeds in completing its operations, and never returns a failure to the caller. Then move on to the next step:

Testing and Measurement

Repeat the tests and measurements from the prior assignment:

TestBasics should perform basic functionality tests.
TestPerf should measure the throughput of each operation.
TestOutliers should measure the fastest and slowest of each operation.

However, this time, kill and restart the server in the middle of each test. The client program should briefly pause, reconnect, and then keep going without losing any results.

Turning In

Please review the general instructions for submitting assignments.

Turn in all of your source code, along with a lab report titled REPORT that describes the following in detail:

A summary of the throughput (ops/time) and latency (time/op) observed for each kind of RPC from TestPerf
A summary of the fastest and slowest operations observed in TestOutliers
A discussion of the significance of your results, noting how they compare to the results from the prior assignment.