Page 2 – Center for Computing Research (CCR)

Improving Application Resilience to Memory Errors with Lightweight Compression

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI DOI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.; Ferreira, Kurt B.; Widener, Patrick W.; Bridges, Patrick G.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI DOI

Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Levy, Scott L.

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 10¹⁸ oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.

More Details

TYPE Thesis or Dissertation YEAR 2016

OSTI

How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms

Levy, Scott L.; Ferreira, Kurt B.; Widener, Patrick W.; Bridges, Patrick B.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Using Machine Learning to Optimize Uncoordinated Checkpointing Performance

Ferreira, Kurt B.; Levy, Scott L.; Widener, Patrick W.; Arnold, Dorian A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Exploring the effect of noise on the performance benefit of non-blocking MPI_Allreduce

Ferreira, Kurt; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Understanding the Effects of Communication and Coordination on Checkpointing at Scale

Ferreira, Kurt; Levy, Scott L.; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach

Ferreira, Kurt; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Using simulation to evaluate the performance of resilience strategies and process failures

Levy, Scott L.; Ferreira, Kurt; Widener, Patrick W.

Fault-tolerance has been identified as a major challenge for future extreme-scale systems. Current predictions suggest that, as systems grow in size, failures will occur more frequently. Because increases in failure frequency reduce the performance and scalability of these systems, significant effort has been devoted to developing and refining resilience mechanisms to mitigate the impact of failures. However, effective evaluation of these mechanisms has been challenging. Current systems are smaller and have significantly different architectural features (e.g., interconnect, persistent storage) than we expect to see in next-generation systems. To overcome these challenges, we propose the use of simulation. Simulation has been shown to be an effective tool for investigating performance characteristics of applications on future systems. In this work, we: identify the set of system characteristics that are necessary for accurate performance prediction of resilience mechanisms for HPC systems and applications; demonstrate how these system characteristics can be incorporated into an existing large-scale simulator; and evaluate the predictive performance of our modified simulator. We also describe how we were able to optimize the simulator for large temporal and spatial scales-allowing the simulator to run 4x faster and use over 100x less memory.

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Understanding the Effects of Communication on Uncoordinated Checkpointing at Scale

Ferreira, Kurt; Widener, Patrick W.; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Predicting the Impact of Failure Avoidance on Checkpoint/Restart in Extreme-Scale Systems

Levy, Scott L.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Predicting Coordinated and Uncoordinated Checkpoint/Restart Protocol Performance at Extreme Scales

Ferreira, Kurt; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.; Ferreira, Kurt; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI DOI

A Holistic Approach to Modeling and Simulation for Resilience and Power Configuration

Ferreira, Kurt; Levy, Scott L.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A Simulation Infrastructure for Examining the Performance of Resilience Strategies at Scale

Ferreira, Kurt; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A simulation infrastructure for examining the performance of resilience strategies at scale

Ferreira, Kurt; Levy, Scott L.

Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real

More Details

TYPE SAND Report YEAR 2013

OSTI DOI