Publications

Results 51–75 of 89
Skip to search filters

A tunable, software-based DRAM error detection and correction library for HPC

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Fiala, David; Ferreira, Kurt; Mueller, Frank; Engelmann, Christian

Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result, additional software will be needed to address this challenge. In this paper we introduce LIBSDC, a tunable, transparent silent data corruption detection and correction library for HPC applications. LIBSDC provides comprehensive SDC protection for program memory by implementing on-demand page integrity verification. Experimental benchmarks with Mantevo HPCCG show that once tuned, LIBSDC is able to achieve SDC protection with 50% overhead of resources, less than the 100% needed for double modular redundancy. © 2012 Springer-Verlag Berlin Heidelberg.

More Details

Libhashckpt: Hash-based incremental checkpointing using GPU's

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ferreira, Kurt; Riesen, Rolf; Brightwell, Ronald B.; Bridges, Patrick; Arnold, Dorian

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

Keeping checkpoint/restart viable for exascale systems

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

More Details

A Model-Based Case for Redundant Computation

Stearley, Jon S.; Robinson, David G.; Ferreira, Kurt

Despite its seemingly nonsensical cost, we show through modeling and simulation that redundant computation merits full consideration as a resilience strategy for next-generation systems. Without revolutionary breakthroughs in failure rates, part counts, or stable-storage bandwidths, it has been shown that the utility of Exascale systems will be crushed by the overheads of traditional checkpoint/restart mechanisms. Alternate resilience strategies must be considered, and redundancy is a proven unrivaled approach in many domains. We develop a distribution-independent model for job interrupts on systems of arbitrary redundancy, adapt Daly’s model for total application runtime, and find that his estimate for optimal checkpoint interval remains valid for redundant systems. We then identify conditions where redundancy is more cost effective than non-redundancy. These are done in the context of the number one supercomputers of the last decade, showing that thorough consideration of redundant computation is timely - if not overdue.

More Details

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

More Details

Redundant computing for exascale systems

Ferreira, Kurt; Stearley, Jon S.; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

More Details

See applications run and throughput jump: The case for redundant computing in HPC

Proceedings of the International Conference on Dependable Systems and Networks

Riesen, Rolf; Ferreira, Kurt; Stearley, Jon S.

For future parallel-computing systems with as few as twenty-thousand nodes we propose redundant computing to reduce the number of application interrupts. The frequency of faults in exascale systems will be so high that traditional checkpoint/restart methods will break down. Applications will experience interruptions so often that they will spend more time restarting and recovering lost work, than computing the solution. We show that redundant computation at large scale can be cost effective and allows applications to complete their work in significantly less wall-clock time. On truly large systems, redundant computing can increase system throughput by an order of magnitude. © 2010 IEEE.

More Details

LDRD final report : a lightweight operating system for multi-core capability class supercomputers

Pedretti, Kevin P.; Levenhagen, Michael J.; Ferreira, Kurt; Brightwell, Ronald B.; Kelly, Suzanne M.; Bridges, Patrick G.

The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.

More Details

Fault-tolerance for exascale systems

Ferreira, Kurt; Riesen, Rolf

Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.

More Details
Results 51–75 of 89
Results 51–75 of 89