MPI meets Multicore

Last month I had the great thrill of being asked to give a talk at the SPIE Astronomy Conference (thanks to Hilton Lewis for inviting me). I gave a broad talk on the state of scientific computing. The issue I got the most feedback on is an odd stress I've noticed between MPI and the proliferation of multicore chips. It goes like this:

MPI is a Message Passing Interface that is the predominant API used in building applications for scientific clusters. There are Java versions; one of my favorites is an MPI-like system called MPJ Express. MPI has been around for years and there's a large body of existing MPI applications in C and Fortran.

On the one hand, MPI is a wonderful thing because it enables scientific applications to take advantage of large clusters. On the other hand it's horrible because it forces applications to ship large quantities of data around, and they often quickly find that communication overheads far overshadow the actual computation. An immense amount of cleverness goes into both making the communication efficient, and minimizing the amount of communication. For example, computational fluid dynamics applications will sometimes do adaptive non-uniform partitioning of the simulation volume so that regions with highly detailed interactions (like vortices) don't get split across computation nodes. Scientific applications generally get much better CPU utilization if they are run in one big address space with all of the CPUs having direct access to it.

But there's a huge catch: these applications are mostly single threaded, having been designed for clusters where each node is a single-CPU machine. But now the shift to multicore is happening and folks are building clusters of SMPs. There is a tendancy to run multiple copies of the application on each machine, one per core. Using MPI within one machine to communicate between the copies of the application. This is crazy - especially on machines with lots of cores: which will be the norm in not too many years. Even though the communication is often highly optimized within the one machine, there's still overhead.

To get the best CPU utilization the apps have to be written to be multithreaded on a cluster node, and use MPI between nodes. This is the worst of both worlds because you have to architect for threading and clustering at the same time. This is pretty straightforward in the Java world because we have great threading facilities, but folks with bags of Fortran code have trouble (auto-parallelizing matrix code doesn't help nearly enough).

This is (almost) a non-issue for folks writing enterprise applications using the JavaEE frameworks because the containers deal with both threading and clustering.

June 23, 2006