Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.
Proceedings of the International Conference on Dependable Systems and Networks
Ibtesham, Dewan; Debonis, David; Arnold, Dorian; Ferreira, Kurt
As high-performance computing systems continue to grow in size and complexity, energy efficiency and reliability have emerged as first-order concerns. Researchers have shown that data movement is a significant contributing factor to power consumption on these systems. Additionally, rollback/recovery protocols like checkpoint/restart can generate large volumes of data traffic exacerbating the energy and power concerns. In this work, we show that a coarse-grained model can be used effectively to speculate about the energy footprints of rollback/recovery protocols. Using our validated model, we evaluate the energy footprint of checkpoint compression, a method that incurs higher computational demand to reduce data volumes and data traffic. Specifically, we show that while checkpoint compression leads to more frequent checkpoints (as per the optimal checkpoint frequency) and increases per checkpoint energy cost, compression still yields a decrease in total application energy consumption due to the overall runtime decrease.
Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales-allowing the simulator to run 4x faster and use over 100x less memory.
Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and node failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3