Center for Computing Research (CCR)

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly’s model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely - if not overdue.

More Details

TYPE SAND Report YEAR 2011

OSTI DOI

A Simulation Infrastructure for Examining the Performance of Resilience Strategies at Scale

Ferreira, Kurt; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A simulation infrastructure for examining the performance of resilience strategies at scale

Ferreira, Kurt; Levy, Scott L.

Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

A study of the viability of exploiting memory content similarity to improve resilience to memory errors

International Journal of High Performance Computing Applications

Levy, Scott; Ferreira, Kurt; Bridges, Patrick G.; Thompson, Aidan P.; Trott, Christian R.

Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grow, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory. In this paper, we propose a novel runtime for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the viability of this approach by examining memory snapshots collected from eight high-performance computing (HPC) applications and two important HPC operating systems. Based on the characteristics of the similarity uncovered, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.

More Details

TYPE Journal Article YEAR 2015

Scopus OSTI DOI

A tunable, software-based DRAM error detection and correction library for HPC

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Fiala, David; Ferreira, Kurt; Mueller, Frank; Engelmann, Christian

Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy. © 2012 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2012

Scopus OSTI

Accelerating Incremental Checkpointing for Extreme-Scale Computing

Proposed for publication in Future Generation Computer Systems.

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2013

OSTI

Addressing Message-log Scalability for Extreme-scale Systems

Topp, Bryan E.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Addressing Scalable I/O Challenges for Exascale

Oldfield, Ron A.; Ferreira, Kurt; Ward, Harry L.; Curry, Matthew L.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Alleviating Scalability Issues of Checkpointing Protocols

Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

An examination of content similarity within the memory of HPC applications

Ferreira, Kurt; Thompson, Aidan P.; Trott, Christian R.; Levy, Scott L.

Abstract not provided.

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

An Extensible Operating System Design for Large-Scale Parallel Machines

Riesen, Rolf; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

An extensible operating system design for large-scale parallel machines

Riesen, Rolf; Ferreira, Kurt

Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and reconfigurable compute elements. An operating system for such machines must be adaptable to the requirements of specific applications and provide abstractions to access next-generation hardware features, without sacrificing performance or scalability.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

An Operating System Resilient to DRAM Failures

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Asking the right questions: Benchmarking fault-tolerant extreme-scale systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt; Levy, Scott; Bridges, Patrick G.; Arnold, Dorian; Brightwell, Ronald B.

Much recent research has explored fault-tolerance mechanisms intended for current and future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions have typically been carried out using relatively uncomplicated computational kernels designed to measure floating point performance. More recent investigations have added scaled-down "proxy" applications to more closely match the composition and behavior of deployed ones. However, the information obtained from these studies (whether floating point performance or application runtime) is not necessarily of the most value in evaluating resilience strategies. We observe that even when using a more sophisticated metric, the information available from evaluating uncoordinated checkpointing using both microbenchmarks and proxy applications does not agree. This implies that not only might researchers be asking the wrong questions, but that the answers to the right ones might be unexpected and potentially misleading. We seek to open a discussion on whether benchmarks designed to provide predictable performance evaluations of HPC hardware and toolchains are providing the right feedback for the evaluation of fault-tolerance in these applications, and more generally on how benchmarking of resilience mechanisms ought to be approached in the exascale design space. © 2014 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2014

Scopus OSTI