Polynomial Preconditioning GMRES with Mixed Precisions
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Anomalous behavior is ubiquitous in subsurface solute transport due to the presence of high degrees of heterogeneity at different scales in the media. Although fractional models have been extensively used to describe the anomalous transport in various subsurface applications, their application is hindered by computational challenges. Simpler nonlocal models characterized by integrable kernels and finite interaction length represent a computationally feasible alternative to fractional models; yet, the informed choice of their kernel functions still remains an open problem. We propose a general data-driven framework for the discovery of optimal kernels on the basis of very small and sparse data sets in the context of anomalous subsurface transport. Using spatially sparse breakthrough curves recovered from fine-scale particle-density simulations, we learn the best coarse-scale nonlocal model using a nonlocal operator regression technique. Predictions of the breakthrough curves obtained using the optimal nonlocal model show good agreement with fine-scale simulation results even at locations and time intervals different from the ones used to train the kernel, confirming the excellent generalization properties of the proposed algorithm. A comparison with trained classical models and with black-box deep neural networks confirms the superiority of the predictive capability of the proposed model.
Proceedings of PMBS 2022: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2022: The International Conference for High Performance Computing, Networking, Storage and Analysis
We propose a new benchmark for high-performance (HP) computers. Similar to High Performance Conjugate Gradient (HPCG), the new benchmark is designed to rank computers based on how fast they can solve a sparse linear system of equations, exhibiting computational and communication requirements typical in many scientific applications. The main novelty of the new benchmark is that it is now based on Generalized Minimum Residual method (GMRES) (combined with Geometric Multi-Grid preconditioner and Gauss-Seidel smoother) and provides the flexibility to utilize lower precision arithmetic. This is motivated by new hardware architectures that deliver lower-precision arithmetic at higher performance. There are other machines that do not follow this trend. However, using a lower-precision arithmetic reduces the required amount of data transfer, which alone could improve solver performance. Considering these trends, an HP benchmark that allows the use of different precisions for solving important scientific problems will be valuable for many different disciplines, and we also hope to promote the design of future HP computers that can utilize mixed-precision arithmetic for achieving high application performance. We present our initial design of the new benchmark, its reference implementation, and the performance of the reference mixed (double and single) precision Geometric Multi-Grid solvers on current top-ranked architectures. We also discuss challenges of designing such a benchmark, along with our preliminary numerical results using 16-bit numerical values (half and bfloat precisions) for solving a sparse linear system of equations.
Abstract not provided.
Computer Methods in Applied Mechanics and Engineering
We propose a domain decomposition method for the efficient simulation of nonlocal problems. Our approach is based on a multi-domain formulation of a nonlocal diffusion problem where the subdomains share “nonlocal” interfaces of the size of the nonlocal horizon. This system of nonlocal equations is first rewritten in terms of minimization of a nonlocal energy, then discretized with a meshfree approximation and finally solved via a Lagrange multiplier approach in a way that resembles the finite element tearing and interconnect method. Specifically, we propose a distributed projected gradient algorithm for the solution of the Lagrange multiplier system, whose unknowns determine the nonlocal interface conditions between subdomains. Several two-dimensional numerical tests on problems as large as 191 million unknowns illustrate the strong and the weak scalability of our algorithm, which outperforms the standard approach to the distributed numerical solution of the problem. Finally, this work is the first rigorous numerical study in a two-dimensional multi-domain setting for nonlocal operators with finite horizon and, as such, it is a fundamental step towards increasing the use of nonlocal models in large scale simulations.
Parallel Computing
Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.
Abstract not provided.
Abstract not provided.
Abstract not provided.
2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Communications in Computational Physics
In this paper we introduce EMPIRE-PIC, a finite element method particle-in-cell (FEM-PIC) application developed at Sandia National Laboratories. The code has been developed in C++ using the Trilinos library and the Kokkos Performance Portability Framework to enable running on multiple modern compute architectures while only requiring maintenance of a single codebase. EMPIRE-PIC is capable of solving both electrostatic and electromagnetic problems in two- and three-dimensions to second-order accuracy in space and time. In this paper we validate the code against three benchmark problems - a simple electron orbit, an electrostatic Langmuir wave, and a transverse electromagnetic wave propagating through a plasma. We demonstrate the performance of EMPIRE-PIC on four different architectures: Intel Haswell CPUs, Intel's Xeon Phi Knights Landing, ARM Thunder-X2 CPUs, and NVIDIA Tesla V100 GPUs attached to IBM POWER9 processors. This analysis demonstrates scalability of the code up to more than two thousand GPUs, and greater than one hundred thousand CPUs.
Abstract not provided.
SIAM Journal on Scientific Computing
The purpose of this paper is to study a Helmholtz problem with a spectral fractional Laplacian, instead of the standard Laplacian. Recently, it has been established that such a fractional Helmholtz problem better captures the underlying behavior in geophysical electromagnetics. We establish the well-posedness and regularity of this problem. We introduce a hybrid spectral-finite element approach to discretize it and show well-posedness of the discrete system. In addition, we derive a priori discretization error estimates. Finally, we introduce an efficient solver that scales as well as the best possible solver for the classical integer-order Helmholtz equation. We conclude with several illustrative examples that confirm our theoretical findings.
SIAM Journal on Numerical Analysis
We consider the integral definition of the fractional Laplacian and analyze a linearquadratic optimal control problem for the so-called fractional heat equation; control constraints are also considered. We derive existence and uniqueness results, first order optimality conditions, and regularity estimates for the optimal variables. To discretize the state equation we propose a fully discrete scheme that relies on an implicit finite difference discretization in time combined with a piecewise linear finite element discretization in space. We derive stability results and a novel L2(0, T;L2(Ω)) a priori error estimate. On the basis of the aforementioned solution technique, we propose a fully discrete scheme for our optimal control problem that discretizes the control variable with piecewise constant functions, and we derive a priori error estimates for it. We illustrate the theory with one- and two-dimensional numerical experiments.
SIAM Journal on Scientific Computing
Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes and more heterogeneous compute architectures make load balancing and the design of low latency network interconnects that are able to satisfy the communication requirements of linear solvers very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchronous communication based on one-sided Message Passing Interface primitives in the context of domain decomposition solvers. In particular, a scalable asynchronous two-level Schwarz method is presented. We discuss practical issues encountered in the development of a scalable solver and show experimental results obtained on a state-of-the-art supercomputer system that illustrate the benefits of asynchronous solvers in load balanced as well as load imbalanced scenarios. Using the novel method, we can observe speedups of up to four times over its classical synchronous equivalent.