A3 - Persistence

Overview

Please review the general instructions for preparing assignments.

In this assignment, you will convert your (previously) in-memory spreadsheet into a persistent service by making use of the checkpointing-and-logging approach discussed in class. Again, you will measure the performance of the operations, and evaluate the cost of persistent storage against memory.

Use your prior assignment submission as the starting point, and continue with the same general code organization as before. It’s OK to add additional files and classes to the project, as long as the server main program is still called SpreadSheetServer.py and the test programs are called TestXYZ.py

Checkpoint and Transaction Log

An online storage service must write data to a persistent location (such as disk or non-volatile memory) so that it does not lose information when the server software or hardware crashes. In a production system, a server would ordinarily pass data to a database server to store it safely, but that’s just passing the buck – the database server now has the same problem of storing data persistently. Let’s solve it ourselves directly.

There are two very simple approaches that don’t scale very well. One approach would be to just write out the entire state of the spreadsheet to a file whenever any element of the table changes. That could work (if we are careful to commit atomically) but would become very expensive as the spreadsheet grows. An alternate approach would be to write the contents of each key and value into a separate file, and just update each individual file as it changes, atomically committing the file on each change. Again, this could work, but would be very expensive if the table grows to millions of entries. Each file in a filesystem requires an inode, a directory entry, and at least one data block (typically 4KB), which would become very inefficient at large scales.

A common solution is to strike a balance between the two approaches by using two data structures stored in separate files: a checkpoint and a transaction log.

A checkpoint file (sheet.ckpt) is a direct representation of the state of the server at some (current or previous) point in time. In this case, it is a direct dump of the state of the hash table: all the keys and their corresponding values. (For this project, the exact representation of the checkpoint file is up to your discretion.)
A transaction log (sheet.log) captures the sequential list of updates applied to a server, since the last checkpoint file was written. The log is structured as a sequence of events that completely and accurately describe each individual change. It is common for each log entry to include additional metadata – the time of change, the user making the change, the previous value, etc – so that it is possible to “audit” prior events and reconstruct the history of the data store. (For this project, the exact representation of the transaction log is up to your discretion.)

Together, the checkpoint and transaction log describe the current state of the spreadsheet. To obtain the current state of the system, the server need only load the checkpoint file in its entirety, and then read the transaction log, “playing back” each entry as a series of modifications to the table in memory. Once all entries are played back, the most up-to-date state of the spreadsheet is present in memory.

Requirements

For this assignment, you will modify your server to store data persistently using a checkpoint file and transaction log. Please call your checkpoint file sheet.ckpt and your transaction log sheet.log, and proceed as follows:

The server should maintain a spreadsheet in memory, as before. For each operation that does not modify the table, it can simply service the request from memory. But, for each modification made to the table, the server must add an entry to the transaction log before modifying the table in memory. (Don’t forget to flush and fsync to ensure that data is written to disk.)
If the transaction log ever becomes too large, then the server must compact the log by writing out a new complete checkpoint file, and deleting the old checkpoint and transaction log. This compaction must be done atomically so that, if interrupted, no data is lost.
On startup, the server should read the checkpoint file into memory, then open the transaction log and “play back” all entries in the log, modifying the spreadsheet in memory as it goes. Once all entries are ready, the server is “caught up” and can continue normal operations. (And, if no checkpoint and transaction log are present, it should assume that this is a “fresh start” and create files as appropriate.)

Invocation

From the end user’s perspective, there should be no change in how the server and client are deployed: both clients and servers are run in the same way as before. (In fact, the client should not change at all!)

To stop the server, simply use kill or Control-C to stop the process. If you have designed your checkpoint and transaction log correctly, then you should be able to simply restart the server, and observe that the state of the spreadsheet is recovered. This mode of operation is known as “crash only” design: the server has no notion of a “clean shutdown” and is always using/testing its capability of recovering from a crash.

For this assignment, your server should compact when the transaction log has more than 100 entries. That’s artificially small, but will make it easy to test and observe that compaction is working.

Testing and Measurement

Continue to use the two test programs from the prior assignment:

TestBasics.py should exercise all five RPC operations (insert, lookup, size, remove, query) to test correctness, as before.
TestPerf.py should time the throughput of all five operations in four stages, as before.

And create a third test program:

TestOutliers.py should perform a large number of insert and remove pairs, sufficient to force the server to compress the transaction log multiple times. Time each individual operation and keep track of the fastest and slowest instance of each.

What to Turn In

Please review the general instructions for submitting assignments.

Turn in all of your source code, along with a lab report titled REPORT that describes the following in detail:

An example of the format of your checkpoint file and transaction log. Shows the contents of each after you have performed a few inserts and removes.
A summary of the throughput (ops/time) and latency (time/op) observed for each kind of RPC from TestPerf
A summary of the fastest and slowest operations observed in TestOutliers
A discussion of the significance of your results, noting how they compare to the results from the prior assignment.