Publications Search

Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

Automatic Differentiation of C++ Codes with Sacado

Phipps, Eric T.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Finite Element Tools for Performance Portability of Implicit and IMEX Simulations on Next Generation Architectures

Pawlowski, Roger; Phipps, Eric T.; Trott, Christian R.; Cyr, Eric C.; Shadid, John N.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

GMRES with embedded ensemble propagation for the efficient solution of parametric linear systems in uncertainty quantification of computational models

Computer Methods in Applied Mechanics and Engineering

Liegeois, Kim; Boman, Romain; Phipps, Eric T.; Wiesner, Tobias A.; Arnst, Maarten

In a previous work, embedded ensemble propagation was proposed to improve the efficiency of sampling-based uncertainty quantification methods of computational models on emerging computational architectures. It consists of simultaneously evaluating the model for a subset of samples together, instead of evaluating them individually. A first approach introduced to solve parametric linear systems with ensemble propagation is ensemble reduction. In Krylov methods for example, this reduction consists in coupling the samples together using an inner product that sums the sample contributions. Ensemble reduction has the advantages of being able to use optimized implementations of BLAS functions and having a stopping criterion which involves only one scalar. However, the reduction potentially decreases the rate of convergence due to the gathering of the spectra of the samples. In this paper, we investigate a second approach: ensemble propagation without ensemble reduction in the case of GMRES. This second approach solves each sample simultaneously but independently to improve the convergence compared to ensemble reduction. This raises two new issues which are solved in this paper: the fact that optimized implementations of BLAS functions cannot be used anymore and that ensemble divergence, whereby individual samples within an ensemble must follow different code execution paths, can occur. We tackle those issues by implementing a high-performing ensemble GEMV and by using masks. The proposed ensemble GEMV leads to a similar cost per GMRES iteration for both approaches, i.e. with and without reduction. For illustration, we study the performances of the new linear solver in the context of a mesh tying problem. This example demonstrates improved ensemble propagation speed-up without reduction.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

Automatic Differentiation of C++ Codes on Emerging Manycore Architectures with Sacado

Phipps, Eric T.; Pawlowski, Roger; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Streaming Tensors

Johnson, Nicholas; Kolda, Tamara G.; Phipps, Eric T.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Computing Generalized CP Decompositions on Emerging Parallel Architectures

Phipps, Eric T.; Kolda, Tamara G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Hierarchic Parallelism Memory Management and Helpful Debugging Tools: Updates on Trilinos Discretization Packages

Pawlowski, Roger; Phipps, Eric T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A Portable SIMD Primitive in Kokkos for Heterogeneous Architectures

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Dynamical System for Resilient Computing

Rothganger, Fredrick R.; Hoemmen, Mark F.; Phipps, Eric T.; Warrender, Christina E.

The effort to develop larger-scale computing systems introduces a set of related challenges: Large machines are more difficult to synchronize. The sheer quantity of hardware introduces more opportunities for errors. New approaches to hardware, such as low-energy or neuromorphic devices are not directly programmable by traditional methods.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

Rigorous Data Fusion for Computationally Expensive Simulations

Winovich, Nick; Rushdi, Ahmad; Phipps, Eric T.; Ray, Jaideep; Lin, Guang; Ebeida, Mohamed

This manuscript comprises the final report for the 1-year, FY19 LDRD project "Rigorous Data Fusion for Computationally Expensive Simulations," wherein an alternative approach to Bayesian calibration was developed based a new sampling technique called VoroSpokes. Vorospokes is a novel quadrature and sampling framework defined with respect to Voronoi tessellations of bounded domains in $R^d$ developed within this project. In this work, we first establish local quadrature and sampling results on convex polytopes using randomly directed rays, or spokes, to approximate the quantities of interest for a specified target function. A theoretical justification for both procedures is provided along with empirical results demonstrating the unbiased convergence in the resulting estimates/samples. The local quadrature and sampling procedures are then extended to global procedures defined on more general domains by applying the local results to the cells of a Voronoi tessellation covering the domain in consideration. We then demonstrate how the proposed global sampling procedure can be used to define a natural framework for adaptively constructing Voronoi Piecewise Surrogate (VPS) approximations based on local error estimates. Finally, we show that the adaptive VPS procedure can be used to form a surrogate model approximation to a specified, potentially unnormalized, density function, and that the global sampling procedure can be used to efficiently draw independent samples from the surrogate density in parallel. The performance of the resulting VoroSpokes sampling framework is assessed on a collection of Bayesian inference problems and is shown to provide highly accurate posterior predictions which align with the results obtained using traditional methods such as Gibbs sampling and random-walk Markov Chain Monte Carlo (MCMC). Importantly, the proposed framework provides a foundation for performing Bayesian inference tasks which is entirely independent from the theory of Markov chains.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

High performance tensor decomposition on emerging manycore architectures

Phipps, Eric T.; Kolda, Tamara G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Embedded UQ on Emerging Computing Architectures

Phipps, Eric T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Parallel Sparse Tensor Decomposition with the Trilinos Parallel Linear Algebra Framework

Devine, Karen; Kolda, Tamara G.; Phipps, Eric T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Software for sparse tensor decomposition on emerging computing architectures

SIAM Journal on Scientific Computing

Phipps, Eric T.; Kolda, Tamara G.

In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Validation Assessment of Hypersonic Double-Cone Flow Simulations using UQ Sensitivity Analysis and Validation Metrics

Kieweg, Sarah; Ray, Jaideep; Weirs, V.G.; Carnes, Brian R.; Dinzl, Derek J.; Freno, Brian A.; Howard, Micah; Phipps, Eric T.; Rider, William J.; Smith, Thomas M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Hierarchical Parallelism for Performant Embedded Automatic Differentiation on GPUs

Phipps, Eric T.; Pawlowski, Roger; Bettencourt, Matthew T.; Cyr, Eric C.; Roberts, Nathan V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Validation Assessment of Hypersonic Double-Cone Flow Simulations using Uncertainty Quantification Sensitivity Analysis and Validation Metrics

Kieweg, Sarah; Ray, Jaideep; Weirs, Gregory; Carnes, Brian R.; Dinzl, Derek J.; Freno, Brian A.; Howard, Micah; Phipps, Eric T.; Rider, William J.; Smith, Thomas M.; Nompelis, Ioannis; Candler, Graham

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

ExaLearn Application Interview

Rajamanickam, Sivasankaran; Wolf, Michael; Phipps, Eric T.; Ebeida, Mohamed; Debusschere, Bert

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

A Performance Portable SIMD Scalar Type for Effective Vectorization Across Heterogeneous Architectures

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Tempus Time-Integration Package & Applications

Ober, Curtis C.; Pawlowski, Roger; Conde, Sidafa; Tezaur, Irina K.; Phipps, Eric T.; Phillips, Edward; Mota, Alejandro; Phlipot, Greg; Ridzal, Denis; Hansen, Michael A.; Fisher, Travis C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ASC ATDM Level 2 Milestone #6358: Assess Status of Next Generation Components and Physics Models in EMPIRE

Bettencourt, Matthew T.; Kramer, Richard M.J.; Cartwright, Keith; Phillips, Edward; Ober, Curtis C.; Pawlowski, Roger; Swan, Matthew S.; Tezaur, Irina K.; Phipps, Eric T.; Conde, Sidafa; Cyr, Eric C.; Ulmer, Craig; Kordenbrock, Todd; Levy, Scott L.N.; Templet, Gary J.; Hu, Jonathan J.; Lin, Paul T.; Glusa, Christian; Siefert, Christopher; Glass, Micheal W.

This report documents the outcome from the ASC ATDM Level 2 Milestone 6358: Assess Status of Next Generation Components and Physics Models in EMPIRE. This Milestone is an assessment of the EMPIRE (ElectroMagnetic Plasma In Realistic Environments) application and three software components. The assessment focuses on the electromagnetic and electrostatic particle-in-cell solutions for EMPIRE and its associated solver, time integration, and checkpoint-restart components. This information provides a clear understanding of the current status of the EMPIRE application and will help to guide future work in FY19 in order to ready the application for the ASC ATDM L1 Milestone in FY20. It is clear from this assessment that performance of the linear solver will have to be a focus in FY19.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI