Introduction
I have a considerable interest in the use of massively parallel cloud systems for handling hard problems, but unfortunately they currently rely on extremely complex infrastructure that is quite intrusive when it comes to writing applications. My intention is, and has been, to find a simple framework that can be used to provide cloud computing capability for Haskell applications while remaining simple and lightweight.
I have discussed elsewhere my initial implementation of MapReduce as a monadic operation that works in parallel processes on a single machine. Here I take the first steps towards analysing what else is needed to turn this into a working cloud based system.
The basics
As I showed in my paper on MapReduce, nearly all massively parallel algorithms are some kind of version of the basic scatter-gather approach. This works as follows:
foreach processing step divide data into chunks give chunks to processing units processing units transform data gather results from processing units concatenate to form next step's data
It’s useful to reformulate this from the processing unit’s point of view, so we get:
foreach processing step request chunk of data while chunk allocated get chunk transform chunk set transformed chunk request chunk of data
This makes it clear that scatter-gather requires two basic distributed services.
The services
The datastore
First we need a simple central datastore that:
- Accepts chunks and concatenates them together to form a complete dataset
- Can return the complete dataset when asked to
This service is, of course, little more than a mildly intelligent front end on a database table.
Note that in standard MapReduce, the partitioning of data into chunks is performed by the central datastore. In view of our desire to run generalised MapReduce, we have to send the entire dataset to all processing nodes. In fact, if this is handled sensibly (e.g. via IP multicast) then there is no difference in network loading at all.
The broker
Next we need a broker, whose job is simply to wait for processing nodes to ask for chunks to be assigned to them, and either to do so or to inform them that the step is finished and they should proceed to the next step. So the broker has to:
- Keep a track on how many processing nodes there are
- Decide how many chunks to divide each step into
- Wait for nodes to request chunks and hand them off one by one
- Decide when the step is completed
There is also a processing node component to the broker, which in each step accepts the processing function, asks for a chunk and waits for a response. If a chunk is forthcoming, it gets the data, applies the function, puts back the result and asks for another chunk. If none is forthcoming it goes on to the next step, accepting a new processing function and proceeding as before.
This structure suggests that in fact what we have here is some kind of distributed monad or related structure. Indeed, as it is clearly stateful, it is more or less inevitable that it should be.
The next step
I have made some observations about implementation, noting that the datastore is a glorified database table, and the broker could be a kind of monad. In my next piece I will follow up on this, and discuss (at a high level) some aspects of how exactly the services could be built.
Have you seen the recent paper, “Haskell for the Cloud”, http://research.microsoft.com/en-us/um/people/simonpj/papers/parallel/remote.pdf
Thank you. That looks like a good basis on which to build what I want (I had been intending to use http://hackage.haskell.org/package/epass as infrastructure, but this looks much more like what I want.