A publication of the Advanced Simulation & Computing Division, NA-121.2, NNSA Defense Programs

June 2009

NA-ASC-500-09—Issue 11
Return to this issue’s stories

Los Alamos Tool Could Improve Checkpoint Bandwidth

The current method for massively parallel applications to protect themselves from component failure is through periodic checkpointing—a process in which applications save their state to persistent storage. Protection from component failure is increasingly important as systems grow in size and the number of components increases. Los Alamos National Laboratory is implementing a tool that will dramatically improve checkpoint performance for many applications.

Following a failure, the applications can resume computation using the last checkpoint state saved. For many applications, saving this state into a shared single file is most convenient. With such an approach, the size of writes are often small and not aligned with file system boundaries. Unfortunately for these applications, this preferred data layout results in pathologically poor performance from the underlying file system, which is optimized for large, aligned writes to nonshared files.

To address this fundamental mismatch, a consortium of researchers has developed a parallel log-structured file system, PLFS, which is positioned between the applications and the underlying parallel file system. PLFS remaps an application’s write access pattern to be optimized for the underlying file system. Through testing on Panasas ActiveScale Storage System and IBM’s General Parallel File System at Los Alamos and on Lustre at Pittsburgh Supercomputer Center, the researchers have seen that this layer of indirection and reorganization can reduce checkpoint time by up to several orders of magnitude for several important benchmarks and real applications (Figure 1). At Los Alamos, PLFS is currently running on Roadrunner and is being tested on Redtail. See the full report at http://institute.lanl.gov/plfs.

Figure 1: A summary of the results showing that the technique improves checkpoint bandwidths for all seven studied benchmarks and applications by up to several orders of magnitude. See the full report at http://institute.lanl.gov/plfs.

DOE Privacy Disclaimer | Sandia Privacy Disclaimer | 2009-4030 W

ASCeNews Archive | Contact Us

sandia logo Developed and maintained by Sandia National Laboratories for NA-121.2