Publications

146 Results

Search results

Jump to search filters

Assembling Portable In-Situ Workflow from Heterogeneous Components using Data Reorganization

Proceedings - 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022

Zhang, Bo; Subedi, Pradeep; Davis, Philip E.; Rizzi, Francesco; Teranishi, Keita; Parashar, Manish

Heterogeneous computing is becoming common in the HPC world. The fast-changing hardware landscape is pushing programmers and developers to rely on performance-portable programming models to rewrite old and legacy applications and develop new ones. While this approach is suitable for individual applications, outstanding challenges still remain when multiple applications are combined into complex workflows. One critical difficulty is the exchange of data between communicating applications where performance constraints imposed by heterogeneous hardware advantage different data layouts. We attempt to solve this problem by exploring asynchronous data layout conversions for applications requiring different memory access patterns for shared data. We implement the proposed solution within the DataSpaces data staging service, extending it to support heterogeneous application workflows across a broad spectrum of programming models. In addition, we integrate heterogeneous DataSpaces with the Kokkos programming model and propose the Kokkos Staging Space as an extension of the Kokkos data abstraction. This new abstraction enables us to express data on a virtual shared space for multiple Kokkos applications, thus guaranteeing the portability of each application when assembling them into an efficient heterogeneous workflow. We present performance results for the Kokkos Staging Space using a synthetic workflow emulator and three different scenarios representing access frequency and use patterns in shared data. The results show that the Kokkos Staging Space is a superior solution in terms of time-to-solution and scalability compared to existing file-based Kokkos data abstractions for inter-application data exchange.

More Details

Integrating process, control-flow, and data resiliency layers using a hybrid Fenix/Kokkos approach

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Whitlock, Matthew J.; Foulk, James W.; Bosilca, George; Bouteiller, Aurelien; Nicolae, Bogdan; Teranishi, Keita; Giem, Elisabeth; Sarkar, Vivek

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.

More Details

Design and Performance of Kokkos Staging Space toward Scalable Resilient Application Couplings

Zhang, Bo; Davis, Philip E.; Subedi, Pradeep; Parashar, Manish; Rizzi, Francesco; Foulk, James W.; Teranishi, Keita

With the growing number of applications designed for heterogeneous HPC devices, application programmers and users are finding it challenging to compose scalable workflows as ensembles of these applications, that are portable, performant and resilient. The Kokkos C++ library has been designed to simplify this cumbersome procedure by providing an intra-application uniform programming model and portable performance. However, assembling multiple Kokkos-enabled applications into a complex workflow is still a challenge. Although Kokkos enables a uniform programming model, the inter-application data exchange still remains a challenge from both performance and software development cost perspectives. In order to address this issue, we propose Kokkos data staging memory space, an extension of Kokkos' data abstraction (memory space) for heterogeneous computing systems. This new abstraction allows to express data on a virtual shared-space for multiple Kokkos applications, thus extending Kokkos to support inter-application data exchange to build an efficient application workflow. Additionally, we study the effectiveness of asynchronous data layout conversions for applications requiring different memory access patterns for the shared data. Our preliminary evaluation with a synthetic benchmark indicate the effectiveness of this conversion adapted to three different scenarios representing access frequency and use patterns of the shared data.

More Details

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

International Journal of High Performance Computing Applications

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco; Cantwell, Chris D.; Duben, Peter D.; Gillard, Mike; Giraud, Luc; Goddeke, Dominik; Raffin, Erwan; Teranishi, Keita; Wedi, Nils

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

More Details

Toward Resilient Heterogeneous Computing Workflow through Kokkos-DataSpaces Integration

Zhang, Bo; Davis, Philip E.; Parshar, Manish; Morales, Nicolas; Teranishi, Keita

With the growing number of applications designed for heterogeneous HPC devices, application programmers and users are finding it challenging to compose scalable workflows as ensembles of these applications, that are portable, performant and resilient. The Kokkos C++ library has been designed to simplify this cumbersome procedure by providing an intra-application uniform programming model and portable performance. However, assembling multiple Kokkos-enabled applications into a complex workflow is still a challenge. Although Kokkos enables a uniform programming model, the inter-application data exchange still remains a challenge from both performance and software development cost perspectives. In order to address this issue, we propose a Kokkos-DataSpaces Integration, with the goal of providing a virtual shared-space abstraction that can be accessed concurrently by all applications in an Kokkos workflow, thus extending Kokkos to support inter-application data exchange.

More Details

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software (Extended Analysis)

Myers, Jeremy M.; Dunlavy, Daniel M.; Teranishi, Keita; Hollman, David S.

Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.

More Details

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth; Mayo, Jackson R.; Teranishi, Keita; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth; Mayo, Jackson R.; Teranishi, Keita; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Grossman, Max; Sarkar, Vivek

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

More Details

SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Teranishi, Keita; Dunlavy, Daniel M.; Myers, Jeremy M.; Barrett, Richard F.

Canonical Polyadic tensor decomposition using alternate Poisson regression (CP-APR) is an effective analysis tool for large sparse count datasets. One of the variants using projected damped Newton optimization for row subproblems (PDNR) offers quadratic convergence and is amenable to parallelization. Despite its potential effectiveness, PDNR performance on modern high performance computing (HPC) systems is not well understood. To remedy this, we have developed a parallel implementation of PDNR using Kokkos, a performance portable parallel programming framework supporting efficient runtime of a single code base on multiple HPC systems. We demonstrate that the performance of parallel PDNR can be poor if load imbalance associated with the irregular distribution of nonzero entries in the tensor data is not addressed. Preliminary results using tensors from the FROSTT data set indicate that using multiple kernels to address this imbalance when solving the PDNR row subproblems in parallel can improve performance, with up to 80% speedup on CPUs and 10-fold speedup on NVIDIA GPUs.

More Details

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Myers, Jeremy M.; Dunlavy, Daniel M.; Teranishi, Keita; Hollman, David S.

Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.

More Details

CoREC: Scalable and Resilient In-memory Data Staging for In-situWorkflows

ACM Transactions on Parallel Computing

Duan, Shaohua; Subedi, Pradeep; Davis, Philip; Teranishi, Keita; Kolla, Hemanth; Gamell, Marc; Parashar, Manish

The dramatic increase in the scale of current and planned high-end HPC systems is leading new challenges, such as the growing costs of data movement and IO, and the reduced mean time between failures (MTBF) of system components. In-situ workflows, i.e., executing the entire application workflows on the HPC system, have emerged as an attractive approach to address data-related challenges by moving computations closer to the data, and staging-based frameworks have been effectively used to support in-situ workflows at scale. However, the resilience of these staging-based solutions has not been addressed, and they remain susceptible to expensive data failures. Furthermore, naive use of data resilience techniques such as n-way replication and erasure codes can impact latency and/or result in significant storage overheads. In this article, we present CoREC, a scalable and resilient in-memory data staging runtime for large-scale in-situ workflows. CoREC uses a novel hybrid approach that combines dynamic replication with erasure coding based on data access patterns. It also leverages multiple levels of replications and erasure coding to support diverse data resiliency requirements. Furthermore, the article presents optimizations for load balancing and conflict-avoiding encoding, and a low overhead, lazy data recovery scheme. We have implemented the CoREC runtime and have deployed with the DataSpaces staging service on leadership class computing machines and present an experimental evaluation in the article. The experiments demonstrate that CoREC can tolerate in-memory data failures while maintaining low latency and sustaining high overall storage efficiency at large scales.

More Details

Report on local data recovery approaches suitable for weather and climate prediction (Deliverable 1.3) (V.1.0)

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco; Cantwell, Chris D.; Duben, Peter D.; Gillard, Mike; Giraud, Luc; Goddeke, Dominik; Raffin, Erwan; Teranishi, Keita; Wedi, Nils

Numerical weather and climate prediction rates as one of the scientific applications whose accuracy improvements greatly depend on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increasing average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys approaches for fault-tolerance in numerical algorithms and system resilience in parallel simulations from the perspective of numerical weather and climate prediction systems. A selection of existing strategies is analyzed, featuring interpolation-restart and compressed checkpointing for the numerics, in-memory checkpointing, user-level failure mitigation-based and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analyzed and some recommendations outlined for future developments.

More Details

Software Resilience using Kokkos Ecosystem

Miles, Jeffery S.; Morales, Nicolas; Teranishi, Keita; Trott, Christian R.

Due to the cost of hardware failures within mission critical and scientific applications, it is necessary for software to provide a mechanism to prevent or recover from interruptions. The Kokkos ecosystem is a programming environment that provides performance and portability to many applications that run on DOE supercomputers as well as smaller scale systems. These applications require a higher level of service due to the cost associated with each simulation or the critical nature of the mission. Software resilience enables an application of manage hardware failures reducing the cost of an interruption. Two different resilience methodologies have been added to the Kokkos ecosystem: checkpointing has been added for restart capabilities and a resilient execution model has been added to account for failures in compute devices. The design and implementation of each of these additions are described, and appropriate examples are included for end users.

More Details

Enabling Resilience in Asynchronous Many-Task Programming Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Paul, Sri R.; Hayashi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.

More Details

FY18 ASC P&EM L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models: Executive Summary

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

FY18 ASC P&EM L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models: Executive Summary

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

FY18 ASC CSSE L2 Milestone 6362: Local Failure Local Recovery (LFLR) Resiliency for Asynchronous Many Task (AMT) Programming and Execution Models

Teranishi, Keita; Clay, Robert L.

The overall goal of this work was to perform an in-depth analysis of resilience schemes adapted to the Asynchronous Many-Task (AMT) programming and execution model with the goal of informing the Sandia Advanced Simulation and Computing (ASC) program's application development strategy for next generation platforms (NGPs).

More Details

ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

More Details

RedThreads: An Interface for Application-Level Fault Detection/Correction Through Adaptive Redundant Multithreading

International Journal of Parallel Programming

Hukerikar, Saurabh; Teranishi, Keita; Diniz, Pedro C.; Lucas, Robert F.

In the presence of accelerated fault rates, which are projected to be the norm on future exascale systems, it will become increasingly difficult for high-performance computing (HPC) applications to accomplish useful computation. Due to the fault-oblivious nature of current HPC programming paradigms and execution environments, HPC applications are insufficiently equipped to deal with errors. We believe that HPC applications should be enabled with capabilities to actively search for and correct errors in their computations. The redundant multithreading (RMT) approach offers lightweight replicated execution streams of program instructions within the context of a single application process. However, the use of complete redundancy incurs significant overhead to the application performance. In this paper we present RedThreads, an interface that provides application-level fault detection and correction based on RMT, but applies the thread-level redundancy adaptively. We describe the RedThreads syntax and semantics, and the supporting compiler infrastructure and runtime system. Our approach enables application programmers to scope the extent of redundant computation. Additionally, the runtime system permits the use of RMT to be dynamically enabled, or disabled, based on the resiliency needs of the application and the state of the system. Our experimental results demonstrate how adaptive RMT exploits programmer insight and runtime inference to dynamically navigate the trade-off space between an application’s resilience coverage and the associated performance overhead of redundant computation.

More Details

Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

SIAM Journal on Scientific Computing

Gamell, Marc; Teranishi, Keita; Kolla, Hemanth; Mayo, Jackson R.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.

More Details

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita; Mayo, Jackson R.; Kolla, Hemanth; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

Specification of Fenix MPI Fault Tolerance library version 1.0.1

Gamble, Marc; Van Der Wijngaart, Rob; Teranishi, Keita; Parashar, Manish

This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery , restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.

More Details

Evaluating Online Global Recovery with Fenix Using Application-Aware In-Memory Checkpointing Techniques

Proceedings of the International Conference on Parallel Processing Workshops

Gamell, Marc; Katz, Daniel S.; Teranishi, Keita; Heroux, Michael A.; Van Der Wijngaart, Rob F.; Mattson, Timothy G.; Parashar, Manish

Exascale systems promise the potential for computation atunprecedented scales and resolutions, but achieving exascale by theend of this decade presents significant challenges. A key challenge isdue to the very large number of cores and components and the resultingmean time between failures (MTBF) in the order of hours orminutes. Since the typical run times of target scientific applicationsare longer than this MTBF, fault tolerance techniques will beessential. An important class of failures that must be addressed isprocess or node failures. While checkpoint/restart (C/R) is currentlythe most widely accepted technique for addressing processor failures, coordinated, stable-storage-based global C/R might be unfeasible atexascale when the time to checkpoint exceeds the expected MTBF. This paper explores transparent recovery via implicitly coordinated, diskless, application-driven checkpointing as a way to tolerateprocess failures in MPI applications at exascale. The discussedapproach leverages User Level Failure Mitigation (ULFM), which isbeing proposed as an MPI extension to allow applications to createpolicies for tolerating process failures. Specifically, this paper demonstrates how different implementations ofapplication-driven in-memory checkpoint storage and recovery comparein terms of performance and scalability. We also experimentally evaluate the effectiveness and scalability ofthe Fenix online global recovery framework on a production system-the Titan Cray XK7 at ORNL-and demonstrate the ability of Fenix totolerate dynamically injected failures using the execution of fourbenchmarks and mini-applications with different behaviors.

More Details

Exploring versioned distributed arrays for resilience in scientific applications: Global view resilience

International Journal of High Performance Computing Applications

Chien, Andrew A.; Balaji, Pavan; Dun, Nan; Fang, Aiman; Fujita, Hajime; Iskra, Kamil; Rubenstein, Zachary; Zheng, Ziming; Hammond, Jeff; Laguna, Ignacio; Richards, David F.; Dubey, Anshu; Van Straalen, Brian; Hoemmen, Mark F.; Heroux, Michael A.; Teranishi, Keita; Siegel, Andrew R.

Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it natural to express and easy to maintain. The ability to name multiple versions and “partially materialize” them efficiently makes ambitious forward-recovery based on “data slices” across versions or data structures both easy to express and efficient. Using several large applications (OpenMC, preconditioned conjugate gradient (PCG) solver, ddcMD, and Chombo), we evaluate the programming effort to add resilience. The required changes are small (< 2% lines of code (LOC)), localized and machine-independent, and perhaps most important, require no software architecture changes. We also measure the overhead of adding GVR versioning and show that overheads < 2% are generally achieved. This overhead suggests that GVR can be implemented in large-scale codes and support portable error recovery with modest investment and runtime impact. Our results are drawn from both IBM BG/Q and Cray XC30 experiments, demonstrating portability. We also present two case studies of flexible error recovery, illustrating how GVR can be used for multi-version rollback recovery, and several different forward-recovery schemes. GVR’s multi-version enables applications to survive latent errors (silent data corruption) with significant detection latency, and forward recovery can make that recovery extremely efficient. Lastly, our results suggest that GVR is scalable, portable, and efficient. GVR interfaces are flexible, supporting a variety of recovery schemes, and altogether GVR embodies a gentle-slope path to tolerate growing error rates in future extreme-scale systems.

More Details

Specification of Fenix MPI Fault Tolerance library (V. 1.0)

Gamble, Marc; Van Der Wijngaart, Rob; Teranishi, Keita; Parashar, Manish

This document provides a specification of Fenix, a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. The library consists of two modules. The first, termed process recovery, restores an application to a consistent state after it has suffered a loss of one or more MPI processes (ranks). The second specifies functions the user can invoke to store application data in Fenix managed redundant storage, and to retrieve it from that storage after process recovery.

More Details

Specification of Fenix MPI Fault Tolerance library (V.0.9)

Gammel, Marc; Van Der Wijngaart, Rob F.; Teranishi, Keita; Parashar, Manish

Fenix is a software library compatible with the Message Passing Interface (MPI) to support fault recovery without application shutdown. This specification is derived from a current implementation of Fenix that employs the User Level Fault Mitigation (ULFM) MPI fault tolerance module proposal. We only present the C library interface for Fenix; the Fortran interface will be added once the C version is complete.

More Details

DARMA 0.3.0-alpha Specification

Wilke, Jeremiah; Hollman, David S.; Slattengren, Nicole L.; Lifflander, Jonathan; Kolla, Hemanth; Rizzi, Francesco; Teranishi, Keita; Bennett, Janine C.

In this document, we provide the specifications for DARMA (Distributed Asynchronous Resilient Models and Applications), a co-design research vehicle for asynchronous many-task (AMT) programming models that serves to: 1) insulate applications from runtime system and hardware idiosyncrasies, 2) improve AMT runtime programmability by co-designing an application programmer interface (API) directly with application developers, 3) synthesize application co-design activities into meaningful requirements for runtime systems, and 4) facilitate AMT design space characterization and definition, accelerating the development of AMT best practices.

More Details

Local recovery and failure masking for stencil-based applications at extreme scales

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Gamell, Marc; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson R.; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.

More Details

Exploring Asynchronous Many-Task Runtime Systems toward Extreme Scales

Knight, Samuel; Baker, Gavin M.; Gamell, Marc; Hollman, David; Sjaardema, Gregor; Kolla, Hemanth; Teranishi, Keita; Wilke, Jeremiah; Slattengren, Nicole; Bennett, Janine C.

Major exascale computing reports indicate a number of software challenges to meet the dramatic change of system architectures in near future. While several-orders-of-magnitude increase in parallelism is the most commonly cited of those, hurdles also include performance heterogeneity of compute nodes across the system, increased imbalance between computational capacity and I/O capabilities, frequent system interrupts, and complex hardware architectures. Asynchronous task-parallel programming models show a great promise in addressing these issues, but are not yet fully understood nor developed su ciently for computational science and engineering application codes. We address these knowledge gaps through quantitative and qualitative exploration of leading candidate solutions in the context of engineering applications at Sandia. In this poster, we evaluate MiniAero code ported to three leading candidate programming models (Charm++, Legion and UINTAH) to examine the feasibility of these models that permits insertion of new programming model elements into an existing code base.

More Details

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, Steven W.; Franko, Ken; Gamell, Marc; Grant, Ryan; Hammond, Simon; Hollman, David S.; Knight, Samuel; Kolla, Hemanth; Lin, Paul T.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita; Wilke, Jeremiah; Bennett, Janine C.; Clay, Robert L.; Kale, Laxkimant; Jain, Nikhil; Mikida, Eric; Aiken, Alex; Bauer, Michael; Lee, Wonchan; Slaughter, Elliott; Treichler, Sean; Berzins, Martin; Harman, Todd; Humphreys, Alan; Schmidt, John; Sunderland, Dan; Mccormick, Pat; Gutierrez, Samuel; Shulz, Martin; Gamblin, Todd; Bremer, Peer-Timo

Abstract not provided.

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, Steven W.; Franko, Ken; Gamell, Marc; Grant, Ryan; Hammond, Simon; Hollman, David S.; Knight, Samuel; Kolla, Hemanth; Lin, Paul T.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita; Wilke, Jeremiah; Bennett, Janine C.; Clay, Robert L.; Kale, Laxkimant; Jain, Nikhil; Mikida, Eric; Aiken, Alex; Bauer, Michael; Lee, Wonchan; Slaughter, Elliott; Treichler, Sean; Berzins, Martin; Harman, Todd; Humphreys, Alan; Schmidt, John; Sunderland, Dan; Mccormick, Pat; Gutierrez, Samuel; Shulz, Martin; Gamblin, Todd; Bremer, Peer T.

This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein

More Details

Exploring failure recovery for stencil-based applications at extreme scales

HPDC 2015 - Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing

Gamell Balmana, Marc; Teranishi, Keita; Heroux, Michael A.; Mayo, Jackson R.; Kolla, Hemanth; Chen, Jacqueline H.; Parashar, Manish

Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.

More Details

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zheng, Ziming; Chien, Andrew A.; Teranishi, Keita

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

More Details

Fault tolerance in an inner-outer solver: A GVR-enabled case study

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Zheng, Ziming; Chien, Andrew A.; Teranishi, Keita

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates.We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Experimental results validate correct execution with low performance overhead under varied error conditions.

More Details

Extreme-scale viability of collective communication for resilient task scheduling and work stealing

Proceedings of the International Conference on Dependable Systems and Networks

Wilke, Jeremiah; Bennett, Janine C.; Kolla, Hemanth; Teranishi, Keita; Slattengren, Nicole L.; Floren, John F.

Extreme-scale computing will bring significant changes to high performance computing system architectures. In particular, the increased number of system components is creating a need for software to demonstrate 'pervasive parallelism' and resiliency. Asynchronous, many-task programming models show promise in addressing both the scalability and resiliency challenges, however, they introduce an enormously challenging distributed, resilient consistency problem. In this work, we explore the viability of resilient collective communication in task scheduling and work stealing and, through simulation with SST/macro, the performance of these collectives on speculative extreme-scale architectures.

More Details

Toward local failure local recovery resilience model using MPI-ULFM

ACM International Conference Proceeding Series

Teranishi, Keita; Heroux, Michael A.

The current system reaction to the loss of a single MPI process is to kill all the remaining processes and restart the application from the most recent checkpoint. This approach will become unfeasible for future extreme scale systems. We address this issue using an emerging resilient computing model called Local Failure Local Recovery (LFLR) that provides application developers with the ability to recover locally and continue application execution when a process is lost. We discuss the design of our software framework to enable the LFLR model using MPI-ULFM and demonstrate the resilient version of MiniFE that achieves a scalable recovery from process failures.

More Details

Report for the ASC CSSE L2 Milestone (4873) - Demonstration of Local Failure Local Recovery Resilient Programming Model

Heroux, Michael A.; Teranishi, Keita

Recovery from process loss during the execution of a distributed memory parallel application is presently achieved by restarting the program, typically from a checkpoint file. Future computer system trends indicate that the size of data to checkpoint, the lack of improvement in parallel file system performance and the increase in process failure rates will lead to situations where checkpoint restart becomes infeasible. In this report we describe and prototype the use of a new application level resilient computing model that manages persistent storage of local state for each process such that, if a process fails, recovery can be performed locally without requiring access to a global checkpoint file. LFLR provides application developers with an ability to recover locally and continue application execution when a process is lost. This report discusses what features are required from the hardware, OS and runtime layers, and what approaches application developers might use in the design of future codes, including a demonstration of LFLR-enabled MiniFE code from the Matenvo mini-application suite.

More Details
146 Results
146 Results