Publications Details
Understanding Memory Failures on a Petascale Arm System
Ferreira, Kurt B.; Levy, Scott L.; Hemmert, Joshua; Laros, James H.
New and novel HPC platforms provide interesting challenges and opportunities. Analysis of these systems can provide a better understanding of both the specific platform being studied as well as large-scale systems in general. Arm is one such architecture that has been explored in HPC for several years, however little is still known about its viability for supporting large-scale production workloads in terms of system reliability. The Astra system at Sandia National Laboratories was the first public peta-FLOPS Arm-based system on the Top500 and has been successfully running production HPC applications for a couple of years. In this paper, we analyze memory failure data collected from Astra while the system was in production running unclassified applications. This analysis revealed several interesting contributions related to both the Arm platform and to HPC systems in general. First, we outline the number of components replaced due to reliability issues in standing-up this first-of-its-kind, large-scale HPC system. We show the distribution differences between correctable DRAM faults and errors on Astra, showing that, not properly accounting for faults can lead to erroneous conclusions. Additionally, we characterize DRAM faults on the system and show contrary to existing work that memory faults are uniformly distributed across CPU socket, DRAM column, bank and rack region, but are not uniform across node, DIMM rank, DIMM slot on the motherboard, and system rack: some racks, ranks and DIMM slots experience more faults than others. Similarly, we show the impact of temperature and power on DRAM correctable errors. Finally, we make a detailed comparison of results presented here with the positional affects found in several previous large-scale reliability studies. The results of this analysis provide valuable guidance to organizations standing-up first-in- class platforms in HPC, organizations using Arm in HPC, and the entire large-scale HPC community in general.