Publications Search

Integrating recent advancements in resilient algorithms and techniques into existing codes is a singular challenge in fault tolerance - in part due to the underlying complexity of implementing resilience in the first place, but also due to the difficulty introduced when integrating the functionality of a standalone new strategy with the preexisting resilience layers of an application. We propose that the answer is not to build integrated solutions for users, but runtimes designed to integrate into a larger comprehensive resilience system and thereby enable the necessary jump to multi-layered recovery. Our work designs, implements, and verifies one such comprehensive system of runtimes. Utilizing Fenix, a process resilience tool with integration into preexisting resilience systems as a design priority, we update Kokkos Resilience and the use pattern of VeloC to support application-level integration of resilience runtimes. Our work shows that designing integrable systems rather than integrated systems allows for user-designed optimization and upgrading of resilience techniques while maintaining the simplicity and performance of all-in-one resilience solutions. More application-specific choice in resilience strategies allows for better long-term flexibility, performance, and - importantly - simplicity.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

Assembling Portable In-Situ Workflow from Heterogeneous Components using Data Reorganization

Proceedings - 22nd IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022

Zhang, Bo; Subedi, Pradeep; Davis, Philip E.; Rizzi, Francesco; Teranishi, Keita; Parashar, Manish

Heterogeneous computing is becoming common in the HPC world. The fast-changing hardware landscape is pushing programmers and developers to rely on performance-portable programming models to rewrite old and legacy applications and develop new ones. While this approach is suitable for individual applications, outstanding challenges still remain when multiple applications are combined into complex workflows. One critical difficulty is the exchange of data between communicating applications where performance constraints imposed by heterogeneous hardware advantage different data layouts. We attempt to solve this problem by exploring asynchronous data layout conversions for applications requiring different memory access patterns for shared data. We implement the proposed solution within the DataSpaces data staging service, extending it to support heterogeneous application workflows across a broad spectrum of programming models. In addition, we integrate heterogeneous DataSpaces with the Kokkos programming model and propose the Kokkos Staging Space as an extension of the Kokkos data abstraction. This new abstraction enables us to express data on a virtual shared space for multiple Kokkos applications, thus guaranteeing the portability of each application when assembling them into an efficient heterogeneous workflow. We present performance results for the Kokkos Staging Space using a synthetic workflow emulator and three different scenarios representing access frequency and use patterns in shared data. The results show that the Kokkos Staging Space is a superior solution in terms of time-to-solution and scalability compared to existing file-based Kokkos data abstractions for inter-application data exchange.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI Scopus

Automated Test Generation for Performance Portable Programs Using Clang/LLVM and Formal Methods

Teranishi, Keita; Mukherjee, Shyamali; Pollard, Samuel D.; Evans, Noah; Orso, Alessandro; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Stencil Applications

Kolla, Hemanth; Mayo, Jackson R.; Whitlock, Matthew J.; Teranishi, Keita; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Performance-Portable Sparse Tensor Decomposition Kernels on Emerging Parallel Architectures

Geronimo Anderson, Sean I.; Teranishi, Keita; Dunlavy, Daniel M.; Choi, Jee

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Design and Performance of Kokkos Staging Space toward Scalable Resilient Application Couplings

Zhang, Bo; Davis, Philip E.; Subedi, Pradeep; Parashar, Manish; Rizzi, Francesco; Foulk, James W.; Teranishi, Keita

With the growing number of applications designed for heterogeneous HPC devices, application programmers and users are finding it challenging to compose scalable workflows as ensembles of these applications, that are portable, performant and resilient. The Kokkos C++ library has been designed to simplify this cumbersome procedure by providing an intra-application uniform programming model and portable performance. However, assembling multiple Kokkos-enabled applications into a complex workflow is still a challenge. Although Kokkos enables a uniform programming model, the inter-application data exchange still remains a challenge from both performance and software development cost perspectives. In order to address this issue, we propose Kokkos data staging memory space, an extension of Kokkos' data abstraction (memory space) for heterogeneous computing systems. This new abstraction allows to express data on a virtual shared-space for multiple Kokkos applications, thus extending Kokkos to support inter-application data exchange to build an efficient application workflow. Additionally, we study the effectiveness of asynchronous data layout conversions for applications requiring different memory access patterns for the shared data. Our preliminary evaluation with a synthetic benchmark indicate the effectiveness of this conversion adapted to three different scenarios representing access frequency and use patterns of the shared data.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

Performance-Portable Sparse Tensor Decomposition Kernels on Emerging Parallel Architectures

Geronimo Anderson, Sean I.; Teranishi, Keita; Dunlavy, Daniel M.; Choi, Jee

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

International Journal of High Performance Computing Applications

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco; Cantwell, Chris D.; Duben, Peter D.; Gillard, Mike; Giraud, Luc; Goddeke, Dominik; Raffin, Erwan; Teranishi, Keita; Wedi, Nils

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

More Details

TYPE Journal Article YEAR 2021

DOI OSTI Scopus