Publications Details

Publications / Other Report

Implementing Software Resiliency in HPX for Extreme Scale Computing

Gupta, Nikunj; Mayo, Jackson M.; Lemoine, Adrian S.; Hartmut, Kaiser

The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.