Publications

Results 51–100 of 146

Search results

Jump to search filters

Software Resilience using Kokkos Ecosystem

Miles, Jeffery S.; Morales, Nicolas; Teranishi, Keita; Trott, Christian R.

Due to the cost of hardware failures within mission critical and scientific applications, it is necessary for software to provide a mechanism to prevent or recover from interruptions. The Kokkos ecosystem is a programming environment that provides performance and portability to many applications that run on DOE supercomputers as well as smaller scale systems. These applications require a higher level of service due to the cost associated with each simulation or the critical nature of the mission. Software resilience enables an application of manage hardware failures reducing the cost of an interruption. Two different resilience methodologies have been added to the Kokkos ecosystem: checkpointing has been added for restart capabilities and a resilient execution model has been added to account for failures in compute devices. The design and implementation of each of these additions are described, and appropriate examples are included for end users.

More Details

Enabling Resilience in Asynchronous Many-Task Programming Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Paul, Sri R.; Hayashi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.

More Details

FY18 ASC P&EM L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models: Executive Summary

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

FY18 ASC P&EM L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models: Executive Summary

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

FY18 ASC CSSE L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

More Details

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

International Journal of Parallel Programming

Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.; Lucas, Robert F.

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance. In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

More Details

Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

SIAM Journal on Scientific Computing

Gamell, Marc; Teranishi, Keita; Kolla, Hemanth; Mayo, Jackson R.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.

More Details

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita; Mayo, Jackson R.; Kolla, Hemanth; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

Specification of Fenix MPI Fault Tolerance library version 1.0.1

Gamble, Marc; Van Der Wijngaart, Rob; Teranishi, Keita; Parashar, Manish

This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery , restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.

More Details

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Proceedings of the International Conference on Parallel Processing Workshops

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita; Heroux, Michael A.; Van Der Wijngaart, Rob F.; Mattson, Timothy G.; Parashar, Manish

Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.

More Details

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

International Journal of High Performance Computing Applications

Chien, Andrew A.; Balaji, Pavan; Dun, Nan; Fang, Aiman; Fujita, Hajime; Iskra, Kamil; Rubenstein, Zachary; Zheng, Ziming; Hammond, Jeff; Laguna, Ignacio; Richards, David F.; Dubey, Anshu; Van Straalen, Brian; Hoemmen, Mark F.; Heroux, Michael A.; Teranishi, Keita; Siegel, Andrew R.

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Lastly, our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.

More Details

Specification of Fenix MPI Fault Tolerance library (V. 1.0)

Gamble, Marc; Van Der Wijngaart, Rob; Teranishi, Keita; Parashar, Manish

This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery, restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.

More Details

Specification of Fenix MPI Fault Tolerance library (V.0.9)

Gammel, Marc; Van Der Wijngaart, Rob F.; Teranishi, Keita; Parashar, Manish

Fenix is a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. This specification is derived from a current implementation of Fenix that employs the User Level Fault Mitigation (ULFM) MPI fault tolerance module proposal. We only present the C library interface for Fenix; the Fortran interface will be added once the C version is complete.

More Details

DARMA 0.3.0-alpha Specification

Wilke, Jeremiah; Hollman, David S.; Slattengren, Nicole L.; Lifflander, Jonathan; Kolla, Hemanth; Rizzi, Francesco; Teranishi, Keita; Bennett, Janine C.

In this document, we provide the specifications for DARMA (Distributed Asynchronous Resilient Models and Applications), a co-design research vehicle for asynchronous many-task (AMT) programming models that serves to: 1) insulate applications from runtime system and hardware idiosyncrasies, 2) improve AMT runtime programmability by co-designing an application programmer interface (API) directly with application developers, 3) synthesize application co-design activities into meaningful requirements for runtime systems, and 4) facilitate AMT design space characterization and definition, accelerating the development of AMT best practices.

More Details
Results 51–100 of 146
Results 51–100 of 146