CCR system software researchers Kurt Ferreira and Scott Levy are co-authors on the paper entitled “Lifetime Memory Reliability Data From the Field,” which has been nominated for Best Paper at the 2017 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT). The paper analyzes the fault rates from a large-scale high-performance computing system collected over five years. Data shows that the devices in the system did not show any signs of aging over the lifetime of the machine. In particular, the rate of DRAM memory errors did not change significantly. A better understanding of component failure rates is necessary to focus on effective mitigation strategies necessary to increase the reliability of leadership-scale computing platforms deployed by the DOE. IEEE DFT is an annual symposium that combines new academic research with state-of-the-art industrial data to explore manufacturing improvements in design, manufacturing, testing, reliability, and availability. This work was supported by NNSA’s Advanced Simulation and Computing program.
August 1, 2017