Publications Search

Advances in Mixed Precision Algorithms: 2021 Edition

Abdelfattah, Ahmad; Anzt, Hartwig; Ayala, Alan; Boman, Erik G.; Carson, Erin C.; Cayrols, Sebastien; Cojean, Terry; Dongarra, Jack J.; Falgout, Rob; Gates, Mark; G, R\{U}Tzmacher; Higham, Nicholas J.; Kruger, Scott E.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer A.; Nayak, Pratik; Osei-Kuffuor, Daniel; Pranesh, Sri; Rajamanickam, Sivasankaran; Ribizel, Tobias; Smith, Bryce; Swirydowicz, Kasia; Thomas, Stephen J.; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Yang, Urike M.

Over the last year, the ECP xSDK-multiprecision effort has made tremendous progress in developing and deploying new mixed precision technology and customizing the algorithms for the hardware deployed in the ECP flagship supercomputers. The effort also has succeeded in creating a cross-laboratory community of scientists interested in mixed precision technology and now working together in deploying this technology for ECP applications. In this report, we highlight some of the most promising and impactful achievements of the last year. Among the highlights we present are: Mixed precision IR using a dense LU factorization and achieving a 1.8× speedup on Spock; results and strategies for mixed precision IR using a sparse LU factorization; a mixed precision eigenvalue solver; Mixed Precision GMRES-IR being deployed in Trilinos, and achieving a speedup of 1.4× over standard GMRES; compressed Basis (CB) GMRES being deployed in Ginkgo and achieving an average 1.4× speedup over standard GMRES; preparing hypre for mixed precision execution; mixed precision sparse approximate inverse preconditioners achieving an average speedup of 1.2×; and detailed description of the memory accessor separating the arithmetic precision from the memory precision, and enabling memory-bound low precision BLAS 1/2 operations to increase the accuracy by using high precision in the computations without degrading the performance. We emphasize that many of the highlights presented here have also been submitted to peer-reviewed journals or established conferences, and are under peer-review or have already been published.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

Properties of GMRES with Iterative Refinement on GPUs

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI Scopus

One-Synch CGS2 Algorithm in the Context of QR and Arnoldi (DCGS2)

Bielich, Daniel R.W.; Langou, Julien; Thomas, Stephen; Swirydowicz, Kasia; Yamazaki, Ichitaro; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

PEEKS Overview

Boman, Erik G.; Bielich, Daniel R.W.; Loe, Jennifer A.; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Sake: Solvers and Kernels for Exascale

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Yamazaki, Ichitaro; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

Solvers and Kernels for Exascale (SAKE) project: Exascle Trilinos Solvers project

Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

ExaWind: Exascale Predictive Wind Plant Flow Physics Modeling

Sprague, Michael; Ananthan, Shreyas; Binyahib, Roba; Brazell, Michael; De Frahan, Marc H.; King, Ryan A.; Mullowney, Paul; Rood, Jon; Sharma, Ashesh; Thomas, Stephen A.; Vijayakumar, Ganesh; Crozier, Paul; Berger-Vergiat, Luc; Cheung, Lawrence; Dement, David C.; Develder, Nathaniel; Glaze, David J.; Hu, Jonathan J.; Knaus, Robert C.; Lee, Dong H.; Matula, Neil; Okusanya, Tolulope O.; Overfelt, James R.; Rajamanickam, Sivasankaran; Sakievich, Philip; Smith, Timothy A.; Vo, Johnathan; Williams, Alan B.; Yamazaki, Ichitaro; Turner, William J.; Prokopenko, Andrey; Wilson, Robert V.; Moser, Robert; Melvin, Jeremy; Sitaraman, Jay

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

Sake: Solvers and Kernels for Exascale

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Boman, Erik G.; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Multiprecision Krylov Solvers in Kokkos and Belos

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Thick-Restart Lanczos with Explicit External Deflation for Computing Many Egenpairs and its Communication-Avoiding Variant

Bai, Zhaojun; Lin, Chao-Ping; Lu, Ding; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Multiprecision Krylov Solvers in Trilinos

Loe, Jennifer A.; Glusa, Christian; Boman, Erik G.; Yamazaki, Ichitaro; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Multiprecision GMRES in Trilinos packages Belos and Kokkos

Loe, Jennifer A.; Glusa, Christian; Boman, Erik G.; Yamazaki, Ichitaro; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Mixed-Precision GMRES in Trilinos

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures

ACM International Conference Proceeding Series

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Sparse triangular solver is an important kernel in many computational applications. However, a fast, parallel, sparse triangular solver on a manycore architecture such as GPU has been an open issue in the field for several years. In this paper, we develop a sparse triangular solver that takes advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We implemented our solver using Kokkos and Kokkos Kernels such that our solver is portable to different manycore architectures. This has the additional benefit of allowing our triangular solver to use the team-level kernels and take advantage of the hierarchical parallelism available on the GPU. We compare the effects of different scheduling schemes on the performance and also investigate an algorithmic variant called the partitioned inverse. Our performance results on an NVIDIA V100 or P100 GPU demonstrate that our implementation can be 12.4 × or 19.5 × faster than the vendor optimized implementation in NVIDIA's CuSPARSE library.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architecture

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Supernode-based Sparse Triangular Solver using Kokkos

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Compare linear-system solver and preconditioner stacks with emphasis on GPU performance and propose phase-2 NGP solver development pathway

Hu, Jonathan J.; Berger-Vergiat, Luc; Thomas, Stephen; Swirydowicz, Kasia; Yamazaki, Ichitaro; Mullowney, Paul; Rajamanickam, Sivasankaran; Sitaraman, Jay; Sprague, Michael

The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary physics codes in the ExaWind project are Nalu-Wind, which is an unstructured-grid solver for the acoustically incompressible Navier-Stokes equations, and OpenFAST, which is a whole-turbine simulation code. The Nalu-Wind model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step. In this report we evaluated GPU-performance baselines for the linear solvers in the Trilinos and hypre solver stacks using two representative Nalu-Wind simulations: an atmospheric boundary layer precursor simulation on a structured mesh, and a fixed-wing simulation using unstructured overset meshes. Both strong-scaling and weak-scaling experiments were conducted on the OLCF supercomputer Summit and similar proxy clusters. We focused on the performance of multi-threaded Gauss-Seidel and two-stage Gauss-Seidel that are extensions of classical Gauss-Seidel; of one-reduce GMRES, a communication-reducing variant of the Krylov GMRES; and algebraic multigrid methods that incorporate the afore-mentioned methods. The team has established that AMG methods are capable of solving linear systems arising from the fixed-wing overset meshes on CPU, a critical intermediate result for ExaWind FY20 Q3 and Q4 milestones. For the fixed-wing strong-scaling study (model with 3M grid-points), the team identified that Nalu-Wind simulations with the new Trilinos and hypre solvers scale to modest GPU counts, maintaining above 70% efficiency up to 6 GPUs. However, there still remain significant bottlenecks to performance: matrix assembly (hypre), AMG setup (hypre and Trilinos) In the weak-scaling experiments (going from 0.4M to 211M gridpoints), it's shown that the solver apply phases are faster on GPUs, but that Nalu-Wind simulation times grow, primarily due to the multigrid-setup process. Finally, based on the report outcomes, we propose a linear solver path-forward for the remainder of the ExaWind project. Near term, the NREL team will continue their work on GPU-based linear-system assembly. They will also investigate how the use of alternatives to the NVIDIA UVM (unified virtual memory) paradigm affects performance. Longer term, the NREL team will evaluate algorithmic performance on other types of accelerators and merge their improvements back to the main hypre repository branch. Near term, the Trilinos team will address performance bottlenecks identified in this milestone, such as implementing a GPU-based segregated momentum solve and reusing matrix graphs across linear-system assembly phases. Longer term, the Trilinos team will do detailed analysis and optimization of multigrid setup.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Preparing sparse solvers for exascale computing

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

Heroux, Michael A.; Anzt, Hartwig; Boman, Erik G.; Falgout, Rob; Ghysels, Pieter; Li, Xiaoye; Mcinnes, Lois C.; Mills, Richard T.; Rajamanickam, Sivasankaran; Rupp, Karl; Smith, Bryce; Yamazaki, Ichitaro; Yang, Ulrike M.

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus