Publications Search

The generalized Dryja-Smith-Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about 2× using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.

More Details

TYPE Conference Paper YEAR 2023

DOI OSTI Scopus

High-Performance GMRES Multi-Precision Benchmark

Yamazaki, Ichitaro; Loe, Jennifer A.; Glusa, Christian; Rajamanickam, Sivasankaran; Luszczek, Piotr; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Low-synch Gram–Schmidt with delayed reorthogonalization for Krylov solvers

Parallel Computing

Bielich, Daniel; Langou, Julien; Thomas, Stephen; Swirydowicz, Kasia; Yamazaki, Ichitaro; Boman, Erik G.

The parallel strong-scaling of iterative methods is often determined by the number of global reductions at each iteration. Low-synch Gram–Schmidt algorithms are applied here to the Arnoldi algorithm to reduce the number of global reductions and therefore to improve the parallel strong-scaling of iterative solvers for nonsymmetric matrices such as the GMRES and the Krylov–Schur iterative methods. In the Arnoldi context, the QR factorization is “left-looking” and processes one column at a time. Among the methods for generating an orthogonal basis for the Arnoldi algorithm, the classical Gram–Schmidt algorithm, with reorthogonalization (CGS2) requires three global reductions per iteration. A new variant of CGS2 that requires only one reduction per iteration is presented and applied to the Arnoldi algorithm. Delayed CGS2 (DCGS2) employs the minimum number of global reductions per iteration (one) for a one-column at-a-time algorithm. The main idea behind the new algorithm is to group global reductions by rearranging the order of operations. DCGS2 must be carefully integrated into an Arnoldi expansion or a GMRES solver. Numerical stability experiments assess robustness for Krylov–Schur eigenvalue computations. Performance experiments on the ORNL Summit supercomputer then establish the superiority of DCGS2 over CGS2.

More Details

TYPE Journal Article YEAR 2022

DOI OSTI Scopus

Mixed Precision Strategies for GMRES in TrilinosJennifer

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2022

OSTI

Mixed Precision s-step Conjugate Gradient with Residual Replacements on GPUs

Yamazaki, Ichitaro; Carson, Erin; Kelley, Brian M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Polynomial Preconditioning GMRES with Mixed Precisions

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran; Morgan, Ronald

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

OSTI

Polynomial Preconditioning GMRES with Mixed Precisions

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Trilinos for Exascale

Boman, Erik G.; Rajamanickam, Sivasankaran; Teranishi, Keita; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Presentation YEAR 2022

OSTI

ExaWind: Exascale Predictive Wind Plant Flow Physics Modeling

Sprague, Michael A.; Brazell, Michael; Brunhart-Lupo, Nicholas; Mullowney, Paul; Rood, Jon; Sharma, Ashesh; Thomas, Stephen; Vijayakumar, Ganesh; Crozier, Paul; Berger-Vergiat, Luc; Cheung, Lawrence; Develder, Nathaniel; Hu, Jonathan J.; Knaus, Robert C.; Lee, Dong H.; Matula, Neil; Overfelt, James R.; Sakievich, Philip; Smith, Timothy A.; Williams, Alan B.; Yamazaki, Ichitaro; Turner, John A.; Prokopenko, Andrey; Wilson, Robert; Moser, Robert; Melvin, Jeremy

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2022

DOI OSTI

Accelerating FROSch preconditioner using GPUs

Yamazaki, Ichitaro; Heinlein, Alexander; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Mixed Precision Strategies for GMRES

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Preparing Trilinos Solvers for Exascale Wind Farm Simulations

Hu, Jonathan J.; Berger-Vergiat, Luc; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

High-Performance GMRES Multi-Precision Benchmark: Design, Performance, and Challenges

Proceedings of PMBS 2022: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2022: The International Conference for High Performance Computing, Networking, Storage and Analysis

Yamazaki, Ichitaro; Glusa, Christian; Loe, Jennifer A.; Luszczek, Piotr; Rajamanickam, Sivasankaran; Dongarra, Jack

We propose a new benchmark for high-performance (HP) computers. Similar to High Performance Conjugate Gradient (HPCG), the new benchmark is designed to rank computers based on how fast they can solve a sparse linear system of equations, exhibiting computational and communication requirements typical in many scientific applications. The main novelty of the new benchmark is that it is now based on Generalized Minimum Residual method (GMRES) (combined with Geometric Multi-Grid preconditioner and Gauss-Seidel smoother) and provides the flexibility to utilize lower precision arithmetic. This is motivated by new hardware architectures that deliver lower-precision arithmetic at higher performance. There are other machines that do not follow this trend. However, using a lower-precision arithmetic reduces the required amount of data transfer, which alone could improve solver performance. Considering these trends, an HP benchmark that allows the use of different precisions for solving important scientific problems will be valuable for many different disciplines, and we also hope to promote the design of future HP computers that can utilize mixed-precision arithmetic for achieving high application performance. We present our initial design of the new benchmark, its reference implementation, and the performance of the reference mixed (double and single) precision Geometric Multi-Grid solvers on current top-ranked architectures. We also discuss challenges of designing such a benchmark, along with our preliminary numerical results using 16-bit numerical values (half and bfloat precisions) for solving a sparse linear system of equations.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

Mixed Precision s-step Conjugate Gradient with Residual Replacement on GPUs

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Yamazaki, Ichitaro; Carson, Erin; Kelley, Brian M.

The s-step Conjugate Gradient (CG) algorithm has the potential to reduce the communication cost of standard CG by a factor of s. However, though mathematically equivalent, s-step CG may be numerically less stable compared to standard CG in finite precision, exhibiting slower convergence and decreased attainable accuracy. This limits the use of s-step CG in practice. To improve the numerical behavior of s-step CG and overcome this potential limitation, we incorporate two techniques. First, we improve convergence behavior through the use of higher precision at critical parts of the s-step iteration and second, we integrate a residual replacement strategy into the resulting mixed precision s-step CG to improve attainable accuracy. Our experimental results on the Summit Supercomputer demonstrate that when the higher precision is implemented in hardware, these techniques have virtually no overhead on the iteration time while improving both the convergence rate and the attainable accuracy of s-step CG. Even when the higher precision is implemented in software, these techniques may still reduce the time-to-solution (speedups of up to 1.8times in our experiments), especially when s-step CG suffers from numerical instability with a small step size and the latency cost becomes a significant part of its iteration time.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

Sake December 2021 ECP ST Project Review

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Boman, Erik G.; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Mixed precision s–step Lanczos and conjugate gradient algorithms

Numerical Linear Algebra with Applications

Carson, Erin; Gergelits, Tomas; Yamazaki, Ichitaro

Compared to the classical Lanczos algorithm, the s-step Lanczos variant has the potential to improve performance by asymptotically decreasing the synchronization cost per iteration. However, this comes at a price; despite being mathematically equivalent, the s-step variant may behave quite differently in finite precision, potentially exhibiting greater loss of accuracy and slower convergence relative to the classical algorithm. It has previously been shown that the errors in the s-step version follow the same structure as the errors in the classical algorithm, but are amplified by a factor depending on the square of the condition number of the O(s)-dimensional Krylov bases computed in each outer loop. As the condition number of these s-step bases grows (in some cases very quickly) with s, this limits the s values that can be chosen and thus can limit the attainable performance. In this work, we show that if a select few computations in s-step Lanczos are performed in double the working precision, the error terms then depend only linearly on the conditioning of the s-step bases. This has the potential for drastically improving the numerical behavior of the algorithm with little impact on per-iteration performance. Our numerical experiments demonstrate the improved numerical behavior possible with the mixed precision approach, and also show that this improved behavior extends to mixed precision s-step CG. Here, we present preliminary performance results on NVIDIA V100 GPUs that show that the overhead of extra precision is minimal if one uses precisions implemented in hardware.

More Details

TYPE Journal Article YEAR 2021

DOI OSTI

Harnessing exascale for whole wind farm high-fidelity simulations to improve wind farm efficiency

Crozier, Paul; Adcock, Christiane; Ananthan, Shreyas; Berger-Vergiat, Luc; Brazell, Michael; Brunhart-Lupo, Nicholas; Henry De Frahan, Marc T.; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy; Moser, Bob; Mullowney, Paul; Rood, Jon; Sharma, Ashesh; Thomas, Stephen; Vijayakumar, Ganesh; Williams, Alan B.; Wilson, Robert; Yamazaki, Ichitaro; Sprague, Michael A.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Trilinos User Group MeetingSolvers Update

Rajamanickam, Sivasankaran; Heinlein, Alexander; Thornquist, Heidi K.; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

FROSch Preconditioners for Land Ice Simulations of Greenland and Antarctica

Heinlein, Alexander; Perego, Mauro; Rajamanickam, Sivasankaran; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Demonstrate moving-grid multi-turbine simulations primarily run on GPUs and propose improvements for successful KPP-2

Adcock, Christiane; Ananthan, Shreyas; Berget-Vergiat, Luc; Brazell, Michael; Brunhart-Lupo, Nicholas; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy; Moser, Bob; Mullowney, Paul; Rood, Jon; Sharma, Ashesh; Thomas, Stephen; Vijayakumar, Ganesh; Williams, Alan B.; Wilson, Robert; Yamazaki, Ichitaro; Sprague, Michael

The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, capturing the thin boundary layers, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

FY2021 Q4: Demonstrate moving-grid multi-turbine simulations primarily run on GPUs and propose improvements for successful KPP-2 [Slides]

Adcock, Christiane; Ananthan, Shreyas; Berger-Vergiat, Luc; Brazell, Michael; Brunhart-Lupo, Nicholas; Hu, Jonathan J.; Knaus, Robert C.; Melvin, Jeremy; Moser, Bob; Mullowney, Paul; Rood, Jon; Sharma, Ashesh; Thomas, Stephen; Vijayakumar, Ganesh; Williams, Alan B.; Wilson, Robert; Yamazaki, Ichitaro; Sprague, Michael

Isocontours of Q-criterion with velocity visualized in the wake for two NREL 5-MW turbines operating under uniform-inflow wind speed of 8 m/s. Simulation performed with the hybrid-Nalu-Wind/AMR-Wind solver.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

Publications

Search results