|
A publication of the Office of Advanced Simulation & Computing, NNSA Defense Programs
NA-ASC-500-07—Issue 4
Improving Supercomputer Reliability via Data Mining
Computer logs often provide critical information about malfunction or misuse, but finding and correlating the clues interspersed among millions of lines of time-stamped text messages generated by supercomputers is in itself a challenging data mining task. The Sisyphus toolkit is the result of four years of research and development on how to efficiently find the important nuggets of information in supercomputer logs. Now in production use on Red Storm(sited at Sandia National Laboratories), it has automatically detected—and more importantly isolated—a wide range of problems including failures (disks, I/O controllers, network interfaces, power supplies, and memory), misconfigurations (BIOS, RAID controller, system software, and inconsistent versions), and problematic user behavior (unbalanced RAID stripe usage, inappropriate remote monitoring). This has enabled focused proactive and reactive responses by system administrators, thus increasing system reliability.
Sisyphus is based on the premise that similar computers correctly executing similar workload should produce similar logs—and thus, anomalies warrant investigation. It automatically ranks log files and colorizes words based on information theory, answering the questions, “What is the most unusual logfile?” and “Exactly what makes this logfile unusual?” It works with ASCII text (rather than numerical data as most anomaly detectors do) and is general enough to be used with any computer logs. It provides useful file and word statistics in tabular and plot formats, and includes web and command line interfaces. See http://www.cs.sandia.gov/~jrstear/sisyphus/ for video demos, downloads, and documentation.
Automatically highlighted symptoms of a common cause failure (underlying RAID controller) on multiple Red Storm I/O nodes.
|