Low-synchronization orthogonalization schemes for s-step and pipelined Krylov solvers in Trilinos
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The effort to develop larger-scale computing systems introduces a set of related challenges: Large machines are more difficult to synchronize. The sheer quantity of hardware introduces more opportunities for errors. New approaches to hardware, such as low-energy or neuromorphic devices are not directly programmable by traditional methods.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report documents the completion of milestone STPRO4-4 Kokkos back-ends research, collaborations, development, optimization, and documentation. The Kokkos team updated its existing backend to support the software stack and hardware of DOE's Sierra, Summit and Astra machines. They also collaborated with ECP PathForward vendors on developing backends for possible exa-scale architectures. Furthermore, the team ramped up its engagement with the ISO/C++ committee to accelerate the adoption of features important for the HPC community into the C++ standard.
This report documents the completion of milestone STPRO4-4 Kokkos back-ends research, collaborations, development, optimization, and documentation. The Kokkos team updated its existing backend to support the software stack and hardware of DOE's Sierra, Summit and Astra machines. They also collaborated with ECP PathForward vendors on developing backends for possible exa-scale architectures. Furthermore, the team ramped up its engagement with the ISO/C++ committee to accelerate the adoption of features important for the HPC community into the C++ standard.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
SIAM Journal on Scientific Computing
In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).
Abstract not provided.
Abstract not provided.
Abstract not provided.
23rd AIAA Computational Fluid Dynamics Conference, 2017
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
Abstract not provided.
Abstract not provided.
23rd AIAA Computational Fluid Dynamics Conference, 2017
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
Abstract not provided.
Abstract not provided.
Abstract not provided.