Publications

Results 1–25 of 123
Skip to search filters

Design and Performance of Kokkos Staging Space toward Scalable Resilient Application Couplings

Zhang, Bo Z.; Davis, Philip E.; Subedi, Pradeep S.; Manish, Parashar M.; Rizzi, Francesco R.; Teranishi, Keita T.

With the growing number of applications designed for heterogeneous HPC devices, application programmers and users are finding it challenging to compose scalable workflows as ensembles of these applications, that are portable, performant and resilient. The Kokkos C++ library has been designed to simplify this cumbersome procedure by providing an intra-application uniform programming model and portable performance. However, assembling multiple Kokkos-enabled applications into a complex workflow is still a challenge. Although Kokkos enables a uniform programming model, the inter-application data exchange still remains a challenge from both performance and software development cost perspectives. In order to address this issue, we propose Kokkos data staging memory space, an extension of Kokkos' data abstraction (memory space) for heterogeneous computing systems. This new abstraction allows to express data on a virtual shared-space for multiple Kokkos applications, thus extending Kokkos to support inter-application data exchange to build an efficient application workflow. Additionally, we study the effectiveness of asynchronous data layout conversions for applications requiring different memory access patterns for the shared data. Our preliminary evaluation with a synthetic benchmark indicate the effectiveness of this conversion adapted to three different scenarios representing access frequency and use patterns of the shared data.

More Details

Resilience and fault tolerance in high-performance computing for numerical weather and climate prediction

International Journal of High Performance Computing Applications

Benacchio, Tommaso; Bonaventura, Luca; Altenbernd, Mirco; Cantwell, Chris D.; Düben, Peter D.; Gillard, Mike; Giraud, Luc; Göddeke, Dominik; Raffin, Erwan; Teranishi, Keita T.; Wedi, Nils

Progress in numerical weather and climate prediction accuracy greatly depends on the growth of the available computing power. As the number of cores in top computing facilities pushes into the millions, increased average frequency of hardware and software failures forces users to review their algorithms and systems in order to protect simulations from breakdown. This report surveys hardware, application-level and algorithm-level resilience approaches of particular relevance to time-critical numerical weather and climate prediction systems. A selection of applicable existing strategies is analysed, featuring interpolation-restart and compressed checkpointing for the numerical schemes, in-memory checkpointing, user-level failure mitigation and backup-based methods for the systems. Numerical examples showcase the performance of the techniques in addressing faults, with particular emphasis on iterative solvers for linear systems, a staple of atmospheric fluid flow solvers. The potential impact of these strategies is discussed in relation to current development of numerical weather prediction algorithms and systems towards the exascale. Trade-offs between performance, efficiency and effectiveness of resiliency strategies are analysed and some recommendations outlined for future developments.

More Details

Toward Resilient Heterogeneous Computing Workflow through Kokkos-DataSpaces Integration

Zhang, Bo Z.; Davis, Philip E.; PArshar, Manish P.; Morales, Nicolas M.; Teranishi, Keita T.

With the growing number of applications designed for heterogeneous HPC devices, application programmers and users are finding it challenging to compose scalable workflows as ensembles of these applications, that are portable, performant and resilient. The Kokkos C++ library has been designed to simplify this cumbersome procedure by providing an intra-application uniform programming model and portable performance. However, assembling multiple Kokkos-enabled applications into a complex workflow is still a challenge. Although Kokkos enables a uniform programming model, the inter-application data exchange still remains a challenge from both performance and software development cost perspectives. In order to address this issue, we propose a Kokkos-DataSpaces Integration, with the goal of providing a virtual shared-space abstraction that can be accessed concurrently by all applications in an Kokkos workflow, thus extending Kokkos to support inter-application data exchange.

More Details

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita T.; Mayo, Jackson M.; Grossman, Max; Sarkar, Vivek

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

More Details

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth K.; Mayo, Jackson M.; Teranishi, Keita T.; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

SparTen: Leveraging Kokkos for On-node Parallelism in a Second-Order Method for Fitting Canonical Polyadic Tensor Models to Poisson Data

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Teranishi, Keita T.; Dunlavy, Daniel D.; Myers, Jeremy M.; Barrett, Richard F.

Canonical Polyadic tensor decomposition using alternate Poisson regression (CP-APR) is an effective analysis tool for large sparse count datasets. One of the variants using projected damped Newton optimization for row subproblems (PDNR) offers quadratic convergence and is amenable to parallelization. Despite its potential effectiveness, PDNR performance on modern high performance computing (HPC) systems is not well understood. To remedy this, we have developed a parallel implementation of PDNR using Kokkos, a performance portable parallel programming framework supporting efficient runtime of a single code base on multiple HPC systems. We demonstrate that the performance of parallel PDNR can be poor if load imbalance associated with the irregular distribution of nonzero entries in the tensor data is not addressed. Preliminary results using tensors from the FROSTT data set indicate that using multiple kernels to address this imbalance when solving the PDNR row subproblems in parallel can improve performance, with up to 80% speedup on CPUs and 10-fold speedup on NVIDIA GPUs.

More Details

Parameter Sensitivity Analysis of the SparTen High Performance Sparse Tensor Decomposition Software

2020 IEEE High Performance Extreme Computing Conference, HPEC 2020

Myers, Jeremy M.; Dunlavy, Daniel D.; Teranishi, Keita T.; Hollman, David S.

Tensor decomposition models play an increasingly important role in modern data science applications. One problem of particular interest is fitting a low-rank Canonical Polyadic (CP) tensor decomposition model when the tensor has sparse structure and the tensor elements are nonnegative count data. SparTen is a high-performance C++ library which computes a low-rank decomposition using different solvers: a first-order quasi-Newton or a second-order damped Newton method, along with the appropriate choice of runtime parameters. Since default parameters in SparTen are tuned to experimental results in prior published work on a single real-world dataset conducted using MATLAB implementations of these methods, it remains unclear if the parameter defaults in SparTen are appropriate for general tensor data. Furthermore, it is unknown how sensitive algorithm convergence is to changes in the input parameter values. This report addresses these unresolved issues with large-scale experimentation on three benchmark tensor data sets. Experiments were conducted on several different CPU architectures and replicated with many initial states to establish generalized profiles of algorithm convergence behavior.

More Details
Results 1–25 of 123
Results 1–25 of 123