Publications Search

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zheng, Ziming; Chien, Andrew A.; Teranishi, Keita

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

More Details

TYPE Journal Article YEAR 2015

DOI OSTI Scopus

Failure Masking and Local Recovery for Stencil-based Applications at Extreme Scales

Gamell, Marc; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson R.; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Evaluation of Asynchronous Multitask Programming Models using Mini-Applications

Franko, Kenneth; Sjaardema, Gregory D.; Bennett, Janine C.; Kolla, Hemanth; Lin, Paul T.; Teranishi, Keita; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Extreme-scale viability of collective communication for resilient task scheduling and work stealing

Proceedings of the International Conference on Dependable Systems and Networks

Wilke, Jeremiah; Bennett, Janine C.; Kolla, Hemanth; Teranishi, Keita; Slattengren, Nicole; Floren, John F.

Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate 'pervasive parallelism' and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.

More Details

TYPE Conference YEAR 2014

DOI DOI OSTI OSTI Scopus Scopus

Toward local failure local recovery resilience model using MPI-ULFM

ACM International Conference Proceeding Series

Teranishi, Keita; Heroux, Michael A.

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

Local Recovery of PDE Solvers from Hard Failures

Teranishi, Keita; Heroux, Michael A.; Gamell Balmana, Marc; Parashar, Manish

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI OSTI

DHARMA: Distributed asyncHronous Adaptive Resilient Management of Applications

Teranishi, Keita; Bennett, Janine C.; Floren, John F.; Slattengren, Nicole; Franko, Kenneth; Sjaardema, Gregory D.; Wilke, Jeremiah; Kolla, Hemanth; Clay, Robert L.; Hukerikar, Saurabh; Knight, Samuel

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI OSTI

Resilient execution of sparse matrix kernels in a distributed many-task runtime

Kolla, Hemanth; Teranishi, Keita; Wilke, Jeremiah; Bennett, Janine C.; Slattengren, Nicole; Floren, John F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Toward Local Failure Local Recovery (LFLR) Resilience Model Using MPI-ULFM

Teranishi, Keita; Heroux, Michael A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

DHARMA: Distributed asyncHronous Adaptive Resilient Management of Applications

Bennett, Janine C.; Clay, Robert L.; Floren, John F.; Franko, Kenneth; Hukerikar, Saurabh; Knight, Samuel; Kolla, Hemanth; Sjaardema, Gregory D.; Slattengren, Nicole; Teranishi, Keita; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI OSTI

Local Recovery of PDE Solvers from Hard Failures

Teranishi, Keita; Gamell Balmana, Marc; Heroux, Michael A.; Parashar, Manish R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

An Evaluation of Lazy Fault Detection based on Adaptive Redundant Multithreading

Teranishi, Keita; Hukerikar, Saurabh; Diniz, Pedro; Lucas, Robert F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Auto-Tuning for Unreliable HPC

Teranishi, Keita

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI OSTI

Tolerating Hard Failures for Stencil-based Applications on HPC environments

Gamell Balmana, Marc; Teranishi, Keita

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Report for the ASC CSSE L2 Milestone (4873) - Demonstration of Local Failure Local Recovery Resilient Programming Model

Heroux, Michael A.; Teranishi, Keita

Recovery from process loss during the execution of a distributed memory parallel application is presently achieved by restarting the program, typically from a checkpoint file. Future computer system trends indicate that the size of data to checkpoint, the lack of improvement in parallel file system performance and the increase in process failure rates will lead to situations where checkpoint restart becomes infeasible. In this report we describe and prototype the use of a new application level resilient computing model that manages persistent storage of local state for each process such that, if a process fails, recovery can be performed locally without requiring access to a global checkpoint file. LFLR provides application developers with an ability to recover locally and continue application execution when a process is lost. This report discusses what features are required from the hardware, OS and runtime layers, and what approaches application developers might use in the design of future codes, including a demonstration of LFLR-enabled MiniFE code from the Matenvo mini-application suite.

More Details

TYPE SAND Report YEAR 2014

DOI OSTI