Publications

Publications / Conference Poster

Improving DRAM Fault Characterization through Machine Learning

Baseman, Elisabeth; Debardeleben, Nathan; Ferreira, Kurt B.; Levy, Scott; Raasch, Steven; Sridharan, Vilas; Siddiqua, Taniya; Guan, Qiang

As high-performance computing systems continue to grow in scale and complexity, the study of faults and errors is critical to the design of future systems and mitigation schemes. Fault modes in system DRAM are a frequently-investigated key aspect of memory reliability. While current schemes require offline analysis for proper classification, current state-of-the-art mitigation techniques require accurate online prediction for optimal performance. In this work, we explore the predictive performance of an online machine learning-based approach in classifying DRAM fault modes from two leadership-class supercomputing facilities. Our results compare the predictive performance of this online approach with the current rule-based approach based on expert knowledge, finding a 12% predictive performance improvement. We also investigate the universality of our classifiers by evaluating predictive performance using training data from disparate computing systems to achieve a 7% improvement in predictive performance. Our work provides a critical analysis of this online learning technique and can benefit system designers to help inform best practices for dealing with reliability on future systems.