Prof. Douglas Thain at Notre Dame
Blog: The Wavefront Abstraction
13 Nov 2008 - Douglas Thain
This is the third in a series of posts on the idea of abstractions for distributed computing on clusters, clouds, and grids. An abstraction is a simple interface that allows you to scale up well-structured problems to run on hundreds or thousands of computers at once.
The Wavefront abstraction came up in a discussion with several economists. You want to compute a recurrence relation where each result depends on one or more previous results. The user provides initial conditions along the edges of a matrix, and then you can compute F at position (1,1). Once you do that, then you can compute F at (1,2) and (2,1), and so on. The work progresses like a wave across the matrix, hence the name Wavefront. Here is what it looks like:
We have a first implementation of this abstraction that can run on a Condor pool of multicore machines. You simply run it by stating the function, size of the matrix, and providing some files that state the initial conditions:
wavefront func.exe 100 100
This abstraction is interesting for several reasons. First, it needs a variable number of CPUs over time. Even if you had an infinite number of CPUs, it can only use one in the first step, two in the second, and so on until the wavefront reaches the diagonal of the matrix, after which it decreases again. So, it would be impossible to program this efficiently in a system like MPI where you have to choose a fixed number of CPUs. Instead, you want to allocate more CPUs over time. For example, here is a timeline of a Wavefront run on a 64-CPU cluster. The red line shows the number of CPUs in use, and the green line shows the percent of the problem completed:
Second, the problem has a certain degree of asychrony built in. You do not need to run each diagonal slice of the system in lock step. Instead, each cell can be computed as soon as its neighbors are down. Because of this, different parts of the problem can be delegated to different nodes, allowing them to run and finish at their own pace. If each function is fast, then you can delegate an entire square chunk of the task to a remote processor, and allow it to complete independently.
You can see this in the progress images generated by our implementation of Wavefront. These images shows the state of a workload. Green boxes indicated completed cells, blue indicate cells currently running, yellow are ready to run, and red are not able to run. This is an example of a 10x10 Wavefront running on only five processors:
You can read more about the Wavefront abstraction at the
Cooperative Computing Lab
.
« Prev: An Abstraction for Ensemble Classifiers
Next: Visualizing a Large Distributed System with Enavis »