Return to Steve Plimpton's home page

### Informatics Algorithms and Big-Data Processing

Informatics is the term commonly used for calculations that are more
data-intensive than compute-intensive. Often such calculations are
limited by the size of the data sets that can be handled rather than
the total amount of raw computation they require. Algorithms for data
mining, classificiation, machine learning, and pattern matching can
fall in this category.

Informatics is now being performed on Terabyte and even Petabyte data
sets on large distributed cloud computing platforms via
programming paradigms such as MapReduce, implemented in
software packages like Hadoop. With my background in
large-scale parallel computing, I'm interested in seeing whether
traditional supercomputers can also be used effectively for these
kinds of tasks.

To experiement with this, we've written two different software
packages.

The first implements MapReduce on top of distributed-memory message
passing (MPI). Our open-source software is called the
MapReduce-MPI (MR-MPI) library, and can
be downloaded here. The doc pages for the
library describe the
software in more detail.

Collaborators on the MapReduce-MPI library:

The second is a small open-source library that serves as a framework
for running streaming calculations in a pipelined fashion, as a set of
independent processes passing a continuous stream of data among
themselves. The library is called PHISH, for Parallel
Harness for Informatic Stream Hashing, because fish swim in a stream.
The doc pages for the library describe the
software in more detail.

Collaborator on the PHISH library:

This paper describes the use of PHISH to identify connected
components in a stream of graph edges.

**Maintaining connected components for infinite graph streams**,
J. W. Berry, M. Oster, C. A. Phillips, S. J. Plimpton, T. M Shead,
BigMine-13, a KDD13 workshop - 2nd International Workshop on Big Data,
Streams and Heterogeneous Source Mining, Chicago, IL, Aug
2013. (abstract)

This paper describes the MR-MPI library and several MapReduce graph
algorithms for informatics problems.

**MapReduce in MPI for Large-Scale Graph Algorithms**, S. J. Plimpton
and K. D. Devine, Parallel Computing, 37, 610-632
(2011). (abstract)
(preprint)