Informatics Algorithms and Big-Data Processing

Informatics is the term commonly used for calculations that are more data-intensive than compute-intensive. Often such calculations are limited by the size of the data sets that can be handled rather than the total amount of raw computation they require. Algorithms for data mining, classificiation, machine learning, and pattern matching can fall in this category.

Informatics is now being performed on Terabyte and even Petabyte data sets on large distributed cloud computing platforms via programming paradigms such as MapReduce, implemented in software packages like Hadoop. With my background in large-scale parallel computing, I'm interested in seeing whether traditional supercomputers can also be used effectively for these kinds of tasks.

To experiement with this, we've written two different software packages.

The first implements MapReduce on top of distributed-memory message passing (MPI). Our open-source software is called the MapReduce-MPI (MR-MPI) library, and can be downloaded here. The doc pages for the library describe the software in more detail.

Collaborators on the MapReduce-MPI library:

Karen Devine, Sandia
Jon Berry, Sandia

The second is a small open-source library that serves as a framework for running streaming calculations in a pipelined fashion, as a set of independent processes passing a continuous stream of data among themselves. The library is called PHISH, for Parallel Harness for Informatic Stream Hashing, because fish swim in a stream. The doc pages for the library describe the software in more detail.

Collaborator on the PHISH library:

Tim Shead, Sandia

This paper describes the PHISH library and several streaming graph algorithms.

Streaming data analytics via message passing with application to graph algorithms, S. J. Plimpton and T. Shead, J Parallel and Distributed Computing, 74, 2687-2698 (2014). (abstract)

This paper describes the use of PHISH to identify connected components in a stream of graph edges.

Maintaining connected components for infinite graph streams, J. W. Berry, M. Oster, C. A. Phillips, S. J. Plimpton, T. M Shead, BigMine-13, a KDD13 workshop - 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining, Chicago, IL, Aug 2013. (abstract)

This paper describes the MR-MPI library and several MapReduce graph algorithms for informatics problems.

MapReduce in MPI for Large-Scale Graph Algorithms, S. J. Plimpton and K. D. Devine, Parallel Computing, 37, 610-632 (2011). (abstract)