Checkpoint/restart is the prevailing approach for application fault-tolerance for large-scale systems. With the largest HPC systems already consisting of millions of cores, we expect HPC system core counts to continue to increase. In these large-scale environments, a confluence of issues including extremely low system MTBF and increased I/O pressures has raised concerns about the continued viability of checkpoint/restart-based fault tolerance. The size of an applications checkpoint is the key driving factor in checkpoint/restart performance. To minimize the size of checkpoints volumes, we have developed a library that uses readily available compression algorithms to transparently compress application checkpoints on the fly. Using this library on a number of key DOE simulation workloads, we show that this method can greatly increase the viability of checkpoint restart for current and future systems. Figure 1 shows the application efficiency (percentage of application run time performing useful work) of this compression technique in comparison with traditional (labeled “baseline”) and an optimal incremental checkpointing approach. We see this method significantly outperforms an incremental checkpointing approach, with optimal performance occurring combining both incremental and checkpoint compression.
March 1, 2013