Improving Application Resilience to Memory Errors with Lightweight Compression

Researchers at Sandia National Laboratories have developed an application-independent library that can dramatically improve application performance by transparently correcting detected uncorrectable memory errors using lightweight compression mechanisms.   Resilience to memory errors is identified by the Department of Energy (DOE) as a key challenge for next-generation, extreme-scale systems.  Future systems are projected to contain many petabytes of memory.   In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to significantly increase the frequency of memory errors.  Using a number of key workloads critical to DOE, our results demonstrate the potential of this technique at transparently correcting errors in memory without application intervention and to speed up application time-to-solution by over a factor of two in comparison to the current state-of-the-art checkpoint-based approaches.

Application speedup over current checkpoint-based approaches using lightweight compression to protect against memory errors for a range of failure rates from 30 minute to 2 hour Mean Time to Interrupt (MTTI).
Application speedup over current checkpoint-based approaches using lightweight compression to protect against memory errors for a range of failure rates from 30 minute to 2 hour Mean Time to Interrupt (MTTI).
Contact
Kurt Brian Ferreira, kbferre@sandia.gov

January 1, 2017