Publications

Publications / Conference Poster

Automating DRAM Fault Mitigation by Learning from Experience

Baseman, Elisabeth; Debardeleben, Nathan; Ferreira, Kurt B.; Sridharan, Vilas; Siddiqua, Taniya; Tkachenko, Olena

Current practice for mitigating DRAM hardwarefaults is to simply discard the entire faulty DIMM. However, this becomes increasingly expensive and wasteful as the priceof memory hardware increases and moves physically closer toprocessing units. Accurately characterizing memory faults inreal-time in order to pre-empt future potentially catastrophicfailures is crucial to conserving resources by blacklisting smallaffected regions of memory rather than discarding an entirehardware component. We further evaluate and extend a machinelearning method for DRAM fault characterization introduced inprior work by Baseman et al. at Los Alamos National Laboratory. We report on the usefulness of a variety of training sets, usinga set of production-relevant metrics to evaluate the method ondata from a leadership-class supercomputing facility. We observean increase in percent of faults successfully mitigated as well asa decrease in percent of wasted blacklisted pages, regardless oftraining set, when using the learned algorithm as compared to ahuman-expert, deterministic, and rule-based approach.