View on GitHub

CSE 40771 - Distributed Systems

A5 - Replication

Overview

In this assignment, you will deploy multiple instances of the hashtable server in order to form a high-capacity, resilient cluster. The server should remain unchanged (or nearly so) from the previous assignment, and most of the work will happen in the client. As before, you will perform a performance evaluation to see the benefits and costs of converting the system into a cluster.

Clustering Technique

The objective of clustering servers together is to increase the throughput, capacity, and reliability of the service by distributing the keys and values across multiple servers. In this setup, each server will function as its own independent hashtable, storing some fraction of the data, with no awareness of the other servers. The clients are responsible for deciding which server(s) will store a given key-value pair in a consistent way.

Let us suppose that there are N servers, numbered 0 through N-1. When inserting an item into the distributed hash table, the client will compute a hash function H(s) on the key string, yielding an index I, which may be arbitrarily large.

I = H(key)

To map the index number to a specific server S, simply take the modulus by the total number of servers:

S = H(key) % N

And now S is a number from 0 to N-1, indicating the primary server on which a given key-value is stored.

If we have a good hash function, and the keys are reasonably distributed, then the key-value pairs will be more or less evenly distributed across physical servers, in a way that is consistently computable. This will provide N times the capacity and throughput of a single server, however, it leaves the data vulnerable to server failure. If one server crashes permanently, then 1/N of the total dataset will be lost.

To add redundancy, we can store multiple copies of each key-value pair on K different servers, indicating by S, (S+1)%N, …, (S+K) % N. This will ensure that there are K different copies of each key-value pair. Of course, this now means that any modification of a key-pair requires interacting with K different servers, while reading a key-value pair only requires interacting with one. However, a scan operation will have to interact with all of the servers in order to consider all of the data.

Client Reliability

In the assignments so far, we have not considered the case of what happens when a server is crashed or unreachable: the HashTableClient simply throws an exception and the program stops. Now with multiple servers and copies of data available, the client can take some alternate actions.

When attempting to read a key-value, the client should attempt to access (any) one of the replicas. If that access fails, then the client (ClusterClient) should attempt to access each of the other replicas in turn until it succeeds. If none of the replicas are available, the client should wait five seconds and then start over, until it suceeds.

When attempting to write a key-value, the client should attempt to write each replica in turn. If any one replica cannot be written, then the client should wait five seconds and then try that replica again, repeating until it succeeds. Keep going until all replicas are written.

Under this reliability scheme, the client program should never crash. If a server becomes unavailable, the client should be able to continue to read all values, and even write those not affected by the server crash. Once the necessary servers restart, the client should continue operation.

Requirements

Use your previous assignment as a starting point for this one. The server should remain unchanged. Your client programs should be modified as follows.

First, create a new class called ClusterClient to implement the clustering technique described above. ClusterClient should be designed to work correctly for any values of K and N such that 1 <= K <= N. This class should keep track of the various servers, implement the clustering method, and provide the same methods (insert, lookup, remove, scan) as HashTableClient. To actually connect and communicate with a given server, ClusterClient should just HashTableClient as needed, since the communication with an individual server will not have changed materially.

Then, modify your TestXXX programs to invoke ClusterClient instead of HashTableClient. Change the command line arguments so that each program is invoked like this:

python TestBasics.py <name> <N> <K>

Where name is a base name for your server cluster, N is the number of servers, and K is the number of copies of each data item. The client should put name and N together to determine the names of the server in the cluster. For example, if you run TestBasics.py mytable 3 2, then ClusterClient should look for servers named mytable-0, mytable-1, and mytable-2 and make two copies of each data item.

For your initial functional testing, it will be sufficient to run several servers on the same machine. Since each server will need its own persistent storage (table.txn and table.ckpt), be sure to start each server in its own fresh, empty directory. You can just leave the servers running continuously while you test and debug your client.

We suggest that you start off with N=3 and K=1 and get that debugged and working. Then try increasing K, and ensure that each data item is recorded on multiple nodes. After that, try a few different combinations of N and K. Whenever N changes, you will need to stop your servers, delete their state, and start a new set.

Once you are satisfied that the system works correctly, then you are ready to test for performance.

Performance Evaluation

When you are ready for performance evaluation, you should run each server in the cluster on a different studentXX machine. (Again, be sure that they each run in their own directory.) Evaluate the performance of your clustering technique in the following configurations:

Use the same technique as the previous assignment: run multiple instances of TestPerf simultaneously until the performance “levels out” in some way. Record the performance of each individual client, as well as the sum of the throughputs, and arrange the results nicely in a table. Repeat this process for each of the the three configurations.

What to Turn In

Please review the general instructions for submitting assignments.

Turn in all of your source code, along with a lab report titled REPORT that describes the following in detail: