Publications / Report

An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software

Drake, Richard R.; Drake, Richard R.; Summers, Randall M.

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation code level. Errors and failures would be detected by underlying software layers, communicated to the application through a convenient interface, and then handled by the simulation code itself. Targeted faults included corrupt communication messages, processor node dropouts, and unacceptable slowdown of service from processing nodes. Recovery techniques such as re-sending communication messages and dynamic reallocation of failing processor nodes were considered. However, most fault tolerance mechanisms rely on underlying software layers which were discovered to be lacking to such a degree that mechanisms at the application level could not be implemented. This research effort has been postponed and shifted to these supporting layers.