Publications

Results 1–25 of 36
Skip to search filters

Mixed precision s-step Lanczos and conjugate gradient algorithms

Numerical Linear Algebra with Applications

Carson, Erin; Gergelits, Tomáš; Yamazaki, Ichitaro Y.

Compared to the classical Lanczos algorithm, the s-step Lanczos variant has the potential to improve performance by asymptotically decreasing the synchronization cost per iteration. However, this comes at a price; despite being mathematically equivalent, the s-step variant may behave quite differently in finite precision, potentially exhibiting greater loss of accuracy and slower convergence relative to the classical algorithm. It has previously been shown that the errors in the s-step version follow the same structure as the errors in the classical algorithm, but are amplified by a factor depending on the square of the condition number of the (Formula presented.) -dimensional Krylov bases computed in each outer loop. As the condition number of these s-step bases grows (in some cases very quickly) with s, this limits the s values that can be chosen and thus can limit the attainable performance. In this work, we show that if a select few computations in s-step Lanczos are performed in double the working precision, the error terms then depend only linearly on the conditioning of the s-step bases. This has the potential for drastically improving the numerical behavior of the algorithm with little impact on per-iteration performance. Our numerical experiments demonstrate the improved numerical behavior possible with the mixed precision approach, and also show that this improved behavior extends to mixed precision s-step CG. We present preliminary performance results on NVIDIA V100 GPUs that show that the overhead of extra precision is minimal if one uses precisions implemented in hardware.

More Details

Harnessing exascale for whole wind farm high-fidelity simulations to improve wind farm efficiency

Crozier, Paul C.; Adcock, Christiane A.; Ananthan, Shreyas A.; Berger-Vergiat, Luc B.; Brazell, Michael B.; Brunhart-Lupo, Nicholas B.; Henry de Frahan, Marc T.; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy M.; Moser, Bob M.; Mullowney, Paul M.; Rood, Jon R.; Sharma, Ashesh S.; Thomas, Stephen T.; Vijayakumar, Ganesh V.; Williams, Alan B.; Wilson, Robert V.; Yamazaki, Ichitaro Y.; Sprague, Michael S.

Abstract not provided.

FY2021 Q4: Demonstrate moving-grid multi-turbine simulations primarily run on GPUs and propose improvements for successful KPP-2 [Slides]

Adcock, Christiane A.; Ananthan, Shreyas A.; Berger-Vergiat, Luc B.; Brazell, Michael B.; Brunhart-Lupo, Nicholas B.; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy M.; Moser, Bob M.; Mullowney, Paul M.; Rood, Jon R.; Sharma, Ashesh S.; Thomas, Stephen T.; Vijayakumar, Ganesh V.; Williams, Alan B.; Wilson, Robert V.; Yamazaki, Ichitaro Y.; Sprague, Michael S.

Isocontours of Q-criterion with velocity visualized in the wake for two NREL 5-MW turbines operating under uniform-inflow wind speed of 8 m/s. Simulation performed with the hybrid-Nalu-Wind/AMR-Wind solver.

More Details

Demonstrate moving-grid multi-turbine simulations primarily run on GPUs and propose improvements for successful KPP-2

Adcock, Christiane A.; Ananthan, Shreyas A.; Berget-Vergiat, Luc B.; Brazell, Michael B.; Brunhart-Lupo, Nicholas B.; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy M.; Moser, Bob M.; Mullowney, Paul M.; Rood, Jon R.; Sharma, Ashesh S.; Thomas, Stephen T.; Vijayakumar, Ganesh V.; Williams, Alan B.; Wilson, Robert V.; Yamazaki, Ichitaro Y.; Sprague, Michael S.

The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, capturing the thin boundary layers, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources.

More Details

Advances in Mixed Precision Algorithms: 2021 Edition

Abdelfattah, Ahmad A.; Anzt, Hartwig A.; Ayala, Alan A.; Boman, Erik G.; Carson, Erin C.; Cayrols, Sebastien C.; Cojean, Terry C.; Dongarra, Jack D.; Falgout, Rob F.; Gates, Mark G.; Gr\"{u}tzmacher, Thomas G.; Higham, Nicholas J.; Kruger, Scott E.; Li, Sherry L.; Lindquist, Neil L.; Liu, Yang L.; Loe, Jennifer A.; Nayak, Pratik N.; Osei-Kuffuor, Daniel O.; Pranesh, Sri P.; Rajamanickam, Sivasankaran R.; Ribizel, Tobias R.; Smith, Bryce B.; Swirydowicz, Kasia S.; Thomas, Stephen T.; Tomov, Stanimire T.; M. Tsai, Yaohung M.; Yamazaki, Ichitaro Y.; Yang, Urike M.

Over the last year, the ECP xSDK-multiprecision effort has made tremendous progress in developing and deploying new mixed precision technology and customizing the algorithms for the hardware deployed in the ECP flagship supercomputers. The effort also has succeeded in creating a cross-laboratory community of scientists interested in mixed precision technology and now working together in deploying this technology for ECP applications. In this report, we highlight some of the most promising and impactful achievements of the last year. Among the highlights we present are: Mixed precision IR using a dense LU factorization and achieving a 1.8× speedup on Spock; results and strategies for mixed precision IR using a sparse LU factorization; a mixed precision eigenvalue solver; Mixed Precision GMRES-IR being deployed in Trilinos, and achieving a speedup of 1.4× over standard GMRES; compressed Basis (CB) GMRES being deployed in Ginkgo and achieving an average 1.4× speedup over standard GMRES; preparing hypre for mixed precision execution; mixed precision sparse approximate inverse preconditioners achieving an average speedup of 1.2×; and detailed description of the memory accessor separating the arithmetic precision from the memory precision, and enabling memory-bound low precision BLAS 1/2 operations to increase the accuracy by using high precision in the computations without degrading the performance. We emphasize that many of the highlights presented here have also been submitted to peer-reviewed journals or established conferences, and are under peer-review or have already been published.

More Details

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021

Loe, Jennifer A.; Glusa, Christian A.; Yamazaki, Ichitaro Y.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

More Details

ExaWind: Exascale Predictive Wind Plant Flow Physics Modeling

Sprague, Michael S.; Ananthan, Shreyas A.; Binyahib, Roba B.; Brazell, Michael B.; de Frahan, Marc H.; King, Ryan N.; Mullowney, Paul M.; Rood, Jon R.; Sharma, Ashesh S.; Thomas, Stephen T.; Vijayakumar, Ganesh V.; Crozier, Paul C.; Berger-Vergiat, Luc B.; Cheung, Lawrence C.; Dement, David C.; deVelder, Nathaniel d.; Glaze, D.J.; Hu, Jonathan J.; Knaus, Robert C.; Lee, Dong H.; Matula, Neil M.; Okusanya, Tolulope O.; Overfelt, James R.; Rajamanickam, Sivasankaran R.; Sakievich, Philip S.; Smith, Timothy A.; Vo, Johnathan V.; Williams, Alan B.; Yamazaki, Ichitaro Y.; Turner, William J.; Prokopenko, Andrey P.; Wilson, Robert V.; Moser, &.; Melvin, Jeremy M.; Sitaraman, &.

Abstract not provided.

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures

ACM International Conference Proceeding Series

Yamazaki, Ichitaro Y.; Rajamanickam, Sivasankaran R.; Ellingwood, Nathan D.

Sparse triangular solver is an important kernel in many computational applications. However, a fast, parallel, sparse triangular solver on a manycore architecture such as GPU has been an open issue in the field for several years. In this paper, we develop a sparse triangular solver that takes advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We implemented our solver using Kokkos and Kokkos Kernels such that our solver is portable to different manycore architectures. This has the additional benefit of allowing our triangular solver to use the team-level kernels and take advantage of the hierarchical parallelism available on the GPU. We compare the effects of different scheduling schemes on the performance and also investigate an algorithmic variant called the partitioned inverse. Our performance results on an NVIDIA V100 or P100 GPU demonstrate that our implementation can be 12.4 × or 19.5 × faster than the vendor optimized implementation in NVIDIA's CuSPARSE library.

More Details
Results 1–25 of 36
Results 1–25 of 36