Resilience Requirements for Exascale and Beyond

Research shows that failures in current computer systems are common and are expected to increase in the future. This rise in failures is largely due to future systems having significantly more memory (like DRAM and SRAM) and more complex processing units (such as GPGPUs and FPGAs). To manage these failures, systems must incorporate recovery mechanisms. The unique scale and specific demands of high-performance computing (HPC) systems make this particularly challenging. Understanding both current and future workloads and hardware is essential to evaluate the effectiveness of existing recovery techniques and their suitability for future systems.

Reliability is especially critical as we transition beyond traditional computing in the post-Moore’s law era. Emerging technologies like quantum and neuromorphic computing may provide the necessary performance for next-generation workloads but also introduce uncertainties regarding reliability and accuracy.

This project aims to define the reliability requirements for current and future large-scale systems. By building on previous studies of failures and workloads in leading systems and utilizing a failure data repository, we will assess the reliability of current HPC hardware and identify potential failure-prone areas in future systems.

Our ongoing efforts will continue to focus on guiding reliability practices for exascale systems, influencing new programming models, and expanding our failure data repository for emerging computing technologies.

Contact

Ferreira, Kurt B., kbferre@sandia.gov