Return to Steve Plimpton's home page

Informatics Algorithms

Informatics is the term commonly used for calculations that are more data-intensive than compute-intensive. Often such calculations are limited by the size of the data sets that can be handled rather than the total amount of raw computation they require. Algorithms for data mining, classificiation, machine learning, and pattern matching can fall in this category.

Informatics is now being performed on Terabyte and even Petabyte data sets on large distributed cloud computing platforms via programming paradigms such as MapReduce, implemented in software packages like Hadoop. With my background in large-scale parallel computing, I'm interested in seeing whether traditional supercomputers can also be used effectively for these kinds of tasks.

To experiement with this, we've written two different software packages.

The first implements MapReduce on top of distributed-memory message passing (MPI). Our open-source software is called the MapReduce-MPI (MR-MPI) library, and can be downloaded here. The doc pages for the library describe the software in more detail.

Collaborators on the MapReduce-MPI library:

The second is a small open-source library that serves as a framework for running streaming calculations in a pipelined fashion, as a set of independent processes passing a continuous stream of data among themselves. The library is called PHISH, for Parallel Harness for Informatic Stream Hashing, because fish swim in a stream. The doc pages for the library describe the software in more detail.

Collaborator on the PHISH library:


This paper describes the MR-MPI library and several MapReduce graph algorithms for informatics problems.

MapReduce in MPI for Large-Scale Graph Algorithms, S. J. Plimpton and K. D. Devine, Parallel Computing, 37, 610-632 (2011). (abstract) (preprint)