Publications

Publications / Conference Poster

Optimization-based property-preserving solution recovery for fault-tolerant scalar transport

Ridzal, Denis R.; Bochev, Pavel B.

As the mean time between failures on the future high-performance computing platforms is expected to decrease to just a few minutes, the development of “smart”, property-preserving checkpointing schemes becomes imperative to avoid dramatic decreases in application utilization. In this paper we formulate a generic optimization-based approach for fault-tolerant computations, which separates property preservation from the compression and recovery stages of the checkpointing processes. We then specialize the approach to obtain a fault recovery procedure for a model scalar transport equation, which preserves local solution bounds and total mass. Numerical examples showing solution recovery from a corrupted application state for three different failure modes illustrate the potential of the approach.