Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and the solving of nonlinear problems. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, facilitating wide-spread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the wide-spread adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures, such as GPUs, with limited memory capabilities and requiring massive thread-level concurrency. C++ AD tools must effectively use these environments to bring novel scientific simulations to next-generation DOE experimental and observational facilities. In this project, we investigated source transformation-based automatic differentiation using LLVM compiler infrastructure to automatically generate portable and efficient gradient computations of Kokkos-based code. We have demonstrated that our proposed strategy is feasible by investigating the usage of a prototype LLVM-based source transformation tool to generate gradients of simple functions made of sequences of simple Kokkos parallel regions. Speedups of up to 500x compared to Sacado were observed on NVIDIA V100 GPU.
This is an investigation on two experimental datasets of laminar hypersonic flows, over a double-cone geometry, acquired in Calspan—University at Buffalo Research Center’s Large Energy National Shock (LENS)-XX expansion tunnel. These datasets have yet to be modeled accurately. A previous paper suggested that this could partly be due to mis-specified inlet conditions. The authors of this paper solved a Bayesian inverse problem to infer the inlet conditions of the LENS-XX test section and found that in one case they lay outside the uncertainty bounds specified in the experimental dataset. However, the inference was performed using approximate surrogate models. Here in this paper, the experimental datasets are revisited and inversions for the tunnel test-section inlet conditions are performed with a Navier–Stokes simulator. The inversion is deterministic and can provide uncertainty bounds on the inlet conditions under a Gaussian assumption. It was found that deterministic inversion yields inlet conditions that do not agree with what was stated in the experiments. An a posteriori method is also presented to check the validity of the Gaussian assumption for the posterior distribution. This paper contributes to ongoing work on the assessment of datasets from challenging experiments conducted in extreme environments, where the experimental apparatus is pushed to the margins of its design and performance envelopes.
This is an investigation on two experimental datasets of laminar hypersonic flows, over a double-cone geometry, acquired in Calspan—University at Buffalo Research Center’s Large Energy National Shock (LENS)-XX expansion tunnel. These datasets have yet to be modeled accurately. A previous paper suggested that this could partly be due to mis-specified inlet conditions. The authors of this paper solved a Bayesian inverse problem to infer the inlet conditions of the LENS-XX test section and found that in one case they lay outside the uncertainty bounds specified in the experimental dataset. However, the inference was performed using approximate surrogate models. In this paper, the experimental datasets are revisited and inversions for the tunnel test-section inlet conditions are performed with a Navier–Stokes simulator. The inversion is deterministic and can provide uncertainty bounds on the inlet conditions under a Gaussian assumption. It was found that deterministic inversion yields inlet conditions that do not agree with what was stated in the experiments. An a posteriori method is also presented to check the validity of the Gaussian assumption for the posterior distribution. This paper contributes to ongoing work on the assessment of datasets from challenging experiments conducted in extreme environments, where the experimental apparatus is pushed to the margins of its design and performance envelopes.
This is an investigation on two experimental datasets of laminar hypersonic flows, over a double-cone geometry, acquired in Calspan—University at Buffalo Research Center’s Large Energy National Shock (LENS)-XX expansion tunnel. These datasets have yet to be modeled accurately. A previous paper suggested that this could partly be due to mis-specified inlet conditions. The authors of this paper solved a Bayesian inverse problem to infer the inlet conditions of the LENS-XX test section and found that in one case they lay outside the uncertainty bounds specified in the experimental dataset. However, the inference was performed using approximate surrogate models. In this paper, the experimental datasets are revisited and inversions for the tunnel test-section inlet conditions are performed with a Navier–Stokes simulator. The inversion is deterministic and can provide uncertainty bounds on the inlet conditions under a Gaussian assumption. It was found that deterministic inversion yields inlet conditions that do not agree with what was stated in the experiments. An a posteriori method is also presented to check the validity of the Gaussian assumption for the posterior distribution. This paper contributes to ongoing work on the assessment of datasets from challenging experiments conducted in extreme environments, where the experimental apparatus is pushed to the margins of its design and performance envelopes.
Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.
The objective of this milestone was to finish integrating GenTen tensor software with combustion application Pele using the Ascent in situ analysis software, partnering with the ALPINE and Pele teams. Also, to demonstrate the usage of the tensor analysis as part of a combustion simulation.
The decomposition of higher-order joint cumulant tensors of spatio-temporal data sets is useful in analyzing multi-variate non-Gaussian statistics with a wide variety of applications (e.g. anomaly detection, independent component analysis, dimensionality reduction). Computing the cumulant tensor often requires computing the joint moment tensor of the input data first, which is very expensive using a naïve algorithm. The current state-of-the-art algorithm takes advantage of the symmetric nature of a moment tensor by dividing it into smaller cubic tensor blocks and only computing the blocks with unique values and thus reducing computation. We propose a refactoring of this algorithm by posing its computation as matrix operations, specifically Khatri-Rao products and standard matrix multiplications. An analysis of the computational and cache complexity indicates significant performance savings due to the refactoring. Implementations of our refactored algorithm in Julia show speedups up to 10x over the reference algorithm in single processor experiments. We describe multiple levels of hierarchical parallelism inherent in the refactored algorithm, and present an implementation using an advanced programming model that shows similar speedups in experiments run on a GPU.
In this paper, we develop a method which we call OnlineGCP for computing the Generalized Canonical Polyadic (GCP) tensor decomposition of streaming data. GCP differs from traditional canonical polyadic (CP) tensor decompositions as it allows for arbitrary objective functions which the CP model attempts to minimize. This approach can provide better fits and more interpretable models when the observed tensor data is strongly non-Gaussian. In the streaming case, tensor data is gradually observed over time and the algorithm must incrementally update a GCP factorization with limited access to prior data. In this work, we extend the GCP formalism to the streaming context by deriving a GCP optimization problem to be solved as new tensor data is observed, formulate a tunable history term to balance reconstruction of recently observed data with data observed in the past, develop a scalable solution strategy based on segregated solves using stochastic gradient descent methods, describe a software implementation that provides performance and portability to contemporary CPU and GPU architectures and integrates with Matlab for enhanced usability, and demonstrate the utility and performance of the approach and software on several synthetic and real tensor data sets.
In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.
In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.
In a previous work, embedded ensemble propagation was proposed to improve the efficiency of sampling-based uncertainty quantification methods of computational models on emerging computational architectures. It consists of simultaneously evaluating the model for a subset of samples together, instead of evaluating them individually. A first approach introduced to solve parametric linear systems with ensemble propagation is ensemble reduction. In Krylov methods for example, this reduction consists in coupling the samples together using an inner product that sums the sample contributions. Ensemble reduction has the advantages of being able to use optimized implementations of BLAS functions and having a stopping criterion which involves only one scalar. However, the reduction potentially decreases the rate of convergence due to the gathering of the spectra of the samples. In this paper, we investigate a second approach: ensemble propagation without ensemble reduction in the case of GMRES. This second approach solves each sample simultaneously but independently to improve the convergence compared to ensemble reduction. This raises two new issues which are solved in this paper: the fact that optimized implementations of BLAS functions cannot be used anymore and that ensemble divergence, whereby individual samples within an ensemble must follow different code execution paths, can occur. We tackle those issues by implementing a high-performing ensemble GEMV and by using masks. The proposed ensemble GEMV leads to a similar cost per GMRES iteration for both approaches, i.e. with and without reduction. For illustration, we study the performances of the new linear solver in the context of a mesh tying problem. This example demonstrates improved ensemble propagation speed-up without reduction.
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
The effort to develop larger-scale computing systems introduces a set of related challenges: Large machines are more difficult to synchronize. The sheer quantity of hardware introduces more opportunities for errors. New approaches to hardware, such as low-energy or neuromorphic devices are not directly programmable by traditional methods.
This manuscript comprises the final report for the 1-year, FY19 LDRD project "Rigorous Data Fusion for Computationally Expensive Simulations," wherein an alternative approach to Bayesian calibration was developed based a new sampling technique called VoroSpokes. Vorospokes is a novel quadrature and sampling framework defined with respect to Voronoi tessellations of bounded domains in $R^d$ developed within this project. In this work, we first establish local quadrature and sampling results on convex polytopes using randomly directed rays, or spokes, to approximate the quantities of interest for a specified target function. A theoretical justification for both procedures is provided along with empirical results demonstrating the unbiased convergence in the resulting estimates/samples. The local quadrature and sampling procedures are then extended to global procedures defined on more general domains by applying the local results to the cells of a Voronoi tessellation covering the domain in consideration. We then demonstrate how the proposed global sampling procedure can be used to define a natural framework for adaptively constructing Voronoi Piecewise Surrogate (VPS) approximations based on local error estimates. Finally, we show that the adaptive VPS procedure can be used to form a surrogate model approximation to a specified, potentially unnormalized, density function, and that the global sampling procedure can be used to efficiently draw independent samples from the surrogate density in parallel. The performance of the resulting VoroSpokes sampling framework is assessed on a collection of Bayesian inference problems and is shown to provide highly accurate posterior predictions which align with the results obtained using traditional methods such as Gibbs sampling and random-walk Markov Chain Monte Carlo (MCMC). Importantly, the proposed framework provides a foundation for performing Bayesian inference tasks which is entirely independent from the theory of Markov chains.
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.
This report documents the outcome from the ASC ATDM Level 2 Milestone 6358: Assess Status of Next Generation Components and Physics Models in EMPIRE. This Milestone is an assessment of the EMPIRE (ElectroMagnetic Plasma In Realistic Environments) application and three software components. The assessment focuses on the electromagnetic and electrostatic particle-in-cell solutions for EMPIRE and its associated solver, time integration, and checkpoint-restart components. This information provides a clear understanding of the current status of the EMPIRE application and will help to guide future work in FY19 in order to ready the application for the ASC ATDM L1 Milestone in FY20. It is clear from this assessment that performance of the linear solver will have to be a focus in FY19.
Previous work has demonstrated that propagating groups of samples, called ensembles, together through forward simulations can dramatically reduce the aggregate cost of sampling-based uncertainty propagation methods [E. Phipps, M. D'Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam, SIAM J. Sci. Comput., 39 (2017), pp. C162--C193]. However, critical to the success of this approach when applied to challenging problems of scientific interest is the grouping of samples into ensembles to minimize the total computational work. For example, the total number of linear solver iterations for ensemble systems may be strongly influenced by which samples form the ensemble when applying iterative linear solvers to parameterized and stochastic linear systems. In this paper we explore sample grouping strategies for local adaptive stochastic collocation methods applied to PDEs with uncertain input data, in particular canonical anisotropic diffusion problems where the diffusion coefficient is modeled by truncated Karhunen--Loève expansions. Finally, we demonstrate that a measure of the total anisotropy of the diffusion coefficient is a good surrogate for the number of linear solver iterations for each sample and therefore provides a simple and effective metric for grouping samples.
This report is an outcome of the ASC ATDM Level 2 Milestone 6015: Asynchronous Many-Task Software Stack Demonstration. It comprises a summary and in depth analysis of DARMA and a DARMA-compliant Asynchronous Many-Task (AMT) runtime software stack. Herein performance and productivity of the over- all approach are assessed on benchmarks and proxy applications representative of the Sandia ATDM applications. As part of the effort to assess the perceived strengths and weaknesses of AMT models compared to more traditional methods, experiments were performed on ATS-1 (Advanced Technology Systems) test bed machines and Trinity. In addition to productivity and performance assessments, this report includes findings on the generality of DARMAs backend API as well as findings on interoperability with node- level and network-level system libraries. Together, this information provides a clear understanding of the strengths and limitations of the DARMA approach in the context of Sandias ATDM codes, to guide our future research and development in this area.
In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science Research Institute at Sandia National Laboratories on May 12th-14th, 2014. This document summarizes the goals of the workshop and the results of the presentations and subsequent discussions.
Rigorous modeling of engineering systems relies on efficient propagation of uncertainty from input parameters to model outputs. In recent years, there has been substantial development of probabilistic polynomial chaos (PC) Uncertainty Quantification (UQ) methods, enabling studies in expensive computational models. One approach, termed ”intrusive”, involving reformulation of the governing equations, has been found to have superior computational performance compared to non-intrusive sampling-based methods in relevant large-scale problems, particularly in the context of emerging architectures. However, the utility of intrusive methods has been severely limited due to detrimental numerical instabilities associated with strong nonlinear physics. Previous methods for stabilizing these constructions tend to add unacceptably high computational costs, particularly in problems with many uncertain parameters. In order to address these challenges, we propose to adapt and improve numerical continuation methods for the robust time integration of intrusive PC system dynamics. We propose adaptive methods, starting with a small uncertainty for which the model has stable behavior and gradually moving to larger uncertainty where the instabilities are rampant, in a manner that provides a suitable solution.
We explore rearrangements of classical uncertainty quantification methods with the aim of achieving higher aggregate performance for uncertainty quantification calculations on emerging multicore and many core architectures. We show a rearrangement of the stochastic Galerkin method leads to improved performance and scalability on several computational architectures whereby uncertainty information is propagated at the lowest levels of the simulation code improving memory access patterns, exposing new dimensions of fine grained parallelism, and reducing communication. We also develop a general framework for implementing such rearrangements for a diverse set of uncertainty quantification algorithms as well as computational simulation codes to which they are applied.