Prof. Douglas Thain at Notre Dame

| About | Teaching | Research | Papers | Talks | Service | Blog | Lab |

Blog: Clusters, Grids, and Clouds:It's Turtles All the Way Down

01 Oct 2008 - Douglas Thain

In this blog, I'll discuss open problems and new developments in the field of distributed systems.

A distributed system is a set of independent computers that accomplish a meaningful task by working together. The field covers a huge range of interesting systems, from sensor networks to multi-tier web server farms. For the most part, I will be writing about the sort of large computing systems used to attack very large problems generated by big science and big business.

These systems go by many different names that mean very similar things:

Although people love to argue about these terms, the differences aren't terribly important. In fact, the terms often represent different facets of the same systems. When you submit a job to a cluster , you are treating it like a cloud , because you don't care on exactly which CPU the job runs. A grid is usually built up from multiple clusters and connected together by wide area networks and software. A cloud is usually contained in a single data center. If you need to access it remotely, then you need a grid .

If you want to really want to mix it up, read about Ed Walker's MyCluster , which is a system that uses a grid to allocate a personal cluster , which you can then use as a cloud!

Regardless of what we call these systems, the challenges are the same. How do we design the interactions between pieces so that the system is robust to failures, has acceptable performance, and protects the interests of all of the parties involved? Future posts in this column will focus on these fundamental problems.

So that you know where I am coming from, I am a professor at the University of Notre Dame , where I teach classes in distributed systems, operating systems, and compilers. I direct the Cooperative Computing Lab where a great group of students does the hard work to design and test new distributed systems. We develop solutions to problems in physics, chemistry, biometrics, and other fields, and them evaluate them on our shared testbed of 700 CPUs. I'll be writing more about these ideas in this column.


Next: Troubleshooting Distributed Systems via Data Mining »