Page 2 – Center for Computing Research (CCR)

A simulation infrastructure for examining the performance of resilience strategies at scale

Fault-tolerance is a major challenge for many current and future extreme-scale systems, with many studies showing it to be the key limiter to application scalability. While there are a number of studies investigating the performance of various resilience mechanisms, these are typically limited to scales orders of magnitude smaller than expected for next-generation systems and simple benchmark problems. In this paper we show how, with very minor changes, a previously published and validated simulation framework for investigating appli- cation performance of OS noise can be used to simulate the overheads of various resilience mechanisms at scale. Using this framework, we compare the failure-free performance of this simulator against an analytic model to validate its performance and demonstrate its ability to simulate the performance of two popular rollback recovery methods on traces from real

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

Checkpoint Compression - An Application Transparent Performance Optimization

Ferreira, Kurt

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Protect Yourself: Why Your OS Must Protect Against DRAM Failures

Ferreira, Kurt; Pedretti, Kevin P.; Levy, Scott L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

A Comparison of Compression and Increment-based Checkpoint Optimizations

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Accelerating Incremental Checkpointing for Extreme-Scale Computing

Proposed for publication in Future Generation Computer Systems.

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2013

OSTI

Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale Systems

Levy, Scott L.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

An examination of content similarity within the memory of HPC applications

Ferreira, Kurt; Thompson, Aidan P.; Trott, Christian R.; Levy, Scott L.

Abstract not provided.

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

A GPU-based Checkpoint Compression Study Size Does Matter -- More Than Speed Anyway

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Evaluating the Feasibility of Using Memory Content Similarity to Improve System Resilience

Ferreira, Kurt; Thompson, Aidan P.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Addressing Message-log Scalability for Extreme-scale Systems

Topp, Bryan E.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

An Operating System Resilient to DRAM Failures

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

The case for extensible operating systems for exascale

Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Exploiting Content Similarity to Improve Memory Performance in Exascale Systems

Ferreira, Kurt; Fiala, David F.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

On the Viability of Compression for Reducing the Overheads of Checkpoint/Restart-based Fault Tolerance

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

The Viability of Using Compression to Decrease Message Log Sizes

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Cooperative application/OS DRAM fault recovery

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Evaluating operating system vulnerability to memory errors

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Checkpoint Compression for Improved Checkpoint/Restart

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing

Ferreira, Kurt; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Alleviating Scalability Issues of Checkpointing Protocols

Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Evaluating Operating System Vulnerability to Memory Errors

Ferreira, Kurt; Pedretti, Kevin P.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Does Partial Replication Pay Off?

Stearley, Jon S.; Ferreira, Kurt; Robinson, David G.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Improvements to the Structural Simulation Toolkit

Rodrigues, Arun; Leung, Vitus J.; Levenhagen, Michael J.; Ferreira, Kurt; Hemmert, Karl S.; Barrett, Brian B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Publications