Reading List for Graduate Students

The following is a selection of "classic" papers that describe the foundational principles of distributed systems, along with a selection of well-known systems that make use of these principles. Papers marked (REQ) are required weekly readings for all graduate students. The remainder of the paper are deeper references for concepts discussed in class, and are strongly suggested for any student that intends to specialize in distributed systems.

Graduate students will form reading groups of 2-4 students which will meet weekly to digest each paper. First, each student should read the paper once individually, and take a few notes according to the rubric below. Then, gather with your group later in the week to discuss the paper, work out an example, and write up a single group summary.

Summaries are due at 10PM on Mondays and should be submitted as a single PDF document via the "Assignments" tab in Sakai. All group members must contribute to the discussion and writing of the document, and should be named as authors on the submission. One student should submit on behalf of the whole group. All authors will receive the same grade.

A good summary should be one to two pages of clearly written, well organized text. If the paper presents some opinion or advice (e.g. "A Note on Distributed Computing") then you should summarize the arguments made by the paper in your own words, and offer some reflection on the completeness or applicability of the advice. If the paper is presenting a system or an algorithm, then you should address the following questions:

What is the problem that this paper is proposing to solve?
What is the key observation or insight into the problem?
What is the overall structure of the system or algorithmic approach presented as a solution?
How is the solution evaluated, and what are the results?

Advice and Lessons Learned

(REQ) Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. "A Note on Distributed Computing", Technical Report. PDF
Butler Lampson, "Hints for computer system design", Proceedings of SOSP, 1983. DOI
Saltzer, Reed, and Clark, "End-to-End Arguments in System Design", ACM Transactions on Computer Systems, 1984.
Eric A. Brewer, Lessons from Giant-Scale Services, IEEE Internet Computing. Vol. 5, No. 4. pp. 46-55. July/August 2001. DOI
(REQ) McSherry, Frank, Michael Isard, and Derek G. Murray. "Scalability! But at what COST?" 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015. PDF

Remote Procedure Call

Birrel and Nelson, "Implementing Remote Procedure Calls", ACM Transactions on Computer Systems, 1984. PDF
Andrew Tannenbaum and Robert van Renesse, "A Critique of the Remote Procedure Call Paradigm", (unclear: possibly a technical report?), PDF
(REQ) Philip Bogle and Barbara Liskov, "Reducing Cross Domain Call Overhead Using Batched Futures", Proceedings of OOPSALA 1994, DOI
Soumagne, Jerome, et al. "Mercury: Enabling remote procedure call for high-performance computing." 2013 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2013. DOI

Robustness and Correctness

Butler Lampson and Howard Sturgis, "Crash Recovery in a Distributed Data Storage System", Technical Report, Xerox PARC, 1979. PDF
Richard Schlichting and Fred Schneider, "Fail-Stop Processors: an approach to designing fault tolerant computing systems", ACM Transactions on Computer Systems 1;3, 1983. DOI
P. J. Leu, "Concurrent robust checkpointing and recovery in distributed systems", International Conference on Data Engineering, 1988. DOI
Gerard J. Holzmann, "The Model Checker SPIN", IEEE Transactions on Software Engineering, 23:5, 1997. DOI

Time and Ordering

Leslie Lamport, "Time, Clocks, and Ordering of Events in a Distributed System", Communications of the ACM 12:7, 1978. PDF
(REQ) K. Mani Chandy and Leslie Lamport, "Distributed Snapshots: Determining Global States of Distributed Systems", ACM Transactions on Computer Systems, 3:1, 1985. PDF
Rob Strom and Shaula Yemini, "Optimistic recovery in distributed systems", ACM Transactions on Computer Systems, Aug 1985. DOI
D. Jefferson, B. Beckman, F. Wieland, L. Blume, and M. DiLoreto. "Time warp operating system." In Proceedings of the eleventh ACM Symposium on Operating systems principles, pp. 77-93. 1987. PDF
D. L. Mills, "Internet time synchronization: The network time protocol", IEEE Transactions on Communication, 39:10. DOI

Consistency and Replication

(REQ) C. Gray and D. Cheriton, "Leases: An Efficient Fault Tolerant Mechanism for Distributed File Cache Consistency", ACM Symposium on Operating Systems Principles, 1989. DOI
Ladin, Liskov, Shrira, and Ghemaway, "Providing High Availability with Lazy Replication", ACM TOCS 10:4, 1992. DOI
Renesse and Schneider, "Chain Replication for Supporting High Throughput and Availability", USENIX Symposium on Operating System Design and Implementation, 2004. PDF
(REQ) S. Gilbert and N. Lynch, "Perspectives on the CAP Theorem," in Computer, vol. 45, no. 2, pp. 30-36, Feb. 2012. PDF
(REQ) Joseph Hellerstein and Peter Alvaro, "Keeping CALM: when distributed consistency is easy", Communications of the ACM, September 2020, Vol. 63 No. 9. DOI.

Consensus

Hector Garcia-Molina, "Elections in a Distributed Computing System", IEEE Transactions on Computers, volume C-31, number 1, January 1982. DOI
Glenn Ricart and Ashok Agrawala, "An Optimal Algorithm Election for Mutual Exclusion in Computer Networks", Communications of the ACM, volume 24, number 1, January 1981. DOI
K. Mani Chandi, Jayadev Misra, Laura Hass, "Distributed Deadlock Detection", ACM Transactions on Computer Systems 1:2, 1983, DOI
(REQ) Michael Fischer, Nancy Lynch, Michael Paterson, "Impossibility of Distributed Consensus with One Faulty Process", Journal of the ACM, volume 32, issue 2, April 1985. DOI
(REQ) Leslie Lamport, Robert Shostak, and Marshall Pease, "The Byzantine Generals Problem", ACM Transactions on Programming Languages and Systems, volume 4, number 3, July 1982. PDF
Miguel Castro and Barbara Liskov, "Practical Byzantine Fault Tolerance", Proceedings of OSDI, February 1999. PDF
Leslie Lamport, "Paxos Made Simple", ACM SIGACT News, November 2001. PDF
(REQ) Diego Ongaro and John Osterhout, "(RAFT) In Search of an Understandable Consensus Algorithm", USENIX Annual Technical Conference, June 2014. PDF

Security and Trust

Roger Needham and Michael Schroeder,"Using encryption for authentication in large networks of computers", Communications of the ACM, 1978. DOI
Steiner, Neuman, and Schiller, Kerberos: An authentication service for open network systems, USENIX Winter Conference, 1988. PDF
(REQ) Satoshi Nakamoto, Bitcoin: A Peer-to-Peer Electronic Cash System, 2009. PDF

Case Studies in Distributed Systems

(REQ) Ion Stoica, Robert Morris, David Karger, M. Frans Kasshoek, Hari Balakrishnan, "Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications", ACM SIGCOMM Review, volume 31, issue 4, October 2001. DOI
Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience", Concurrency: Practice and Experience, 2004. DOI
Petros Maniatis, Mema Roussopoulos, T. J. Giuli, David S. H. Rosenthal, and Mary Baker, "The LOCKSS peer-to-peer digital preservation system", ACM Transactions on Computer Systems, 23:1, 2005. DOI
Sage Weil, Scott Brandy, Ethan Miller, Darrell Long, and Carlos Maltzahn, "Ceph: A scalable, high-performance distributed file system.", Proceedings of USENIX OSDI, 2006. HTML
Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Cluster", Communications of the ACM, volume 51, issue 1, January 2008. DOI
Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX annual technical conference. Vol. 8. No. 9. 2010.
Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica, "Spark: Cluster Computing with Working Sets", Proceedings of USENIX HotCloud, 2010. PDF
Kay Osterhout, Patrick Wendell, Matei Zaharia, Ion Stoica, "Sparrow: Distributed, Low Latency Scheduling", Processings of SOSP 2013. DOI
(REQ) Philippe Dobbelaere and Kyumars Sheykh Esmaili, "Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations: Industry Paper", Proceedings of ACM Distributed and Event-based Systems (DEBS), 2017. DOI