Publications
A principled approach to HPC event monitoring
As high-performance computing (HPC) systems become larger and more complex, fault tolerance becomes a greater concern. At the same time, the data volume collected to help in understanding and mitigating hardware and software faults and failures also becomes prohibitively large. We argue that the HPC community must adopt more systematic approaches to system event logging as opposed to the current, ad hoc, strategies based on practitioner intuition and experience. Specifically, we show that event correlation and prediction can increase our understanding of fault behavior and can become critical components of effective fault tolerance strategies. While event correlation and prediction have been used in HPC contexts, we offer new insights about their potential capabilities. Using event logs from the computer failure data repository (cfdr) (1) we use cross and partial correlations to observe conditional correlations in HPC event data; (2) we use information theory to understand the fundamental predictive power of HPC failure data; (3) we study neural networks for failure prediction; and (4) finally, we use principal component analysis to understand to what extent dimensionality reduction can apply to HPC event data. This work results in the following insights that can inform HPC event monitoring: ad hoc correlations or ones based on direct correlations can be deficient or even misleading; highly accurate failure prediction may only require small windows of failure event history; and principal component analysis can significantly reduce HPC event data without loss of relevant information.