Publications Search

Milestone 49 Report: Batched Sparse LA Phase 5 Implementation

Berger-Vergiat, Luc; Liegeois, Kim A.J.; Rajamanickam, Sivasankaran

Batched sparse linear algebra operations in general, and solvers in particular, have become the major algorithmic development activity and foremost performance engineering effort in the numerical software libraries work on modern hardware with accelerators such as GPUs. Many applications, ECP and non-ECP alike, require simultaneous solutions of many small linear systems of equations that are structurally sparse in one form or another. In order to move towards high hardware utilization levels, it is important to provide these applications with appropriate interface designs to be both functionally efficient and performance portable and give full access to the appropriate batched sparse solvers running on modern hardware accelerators prevalent across DOE supercomputing sites since the inception of ECP. To this end, we present here a summary of recent advances on the interface designs in use by HPC software libraries supporting batched sparse linear algebra and the development of sparse batched kernel codes for solvers and preconditioners. We also address the potential interoperability opportunities to keep the corresponding software portable between the major hardware accelerators from AMD, Intel, and NVIDIA, while maintaining the appropriate disclosure levels conforming to the active NDA agreements. The presented interface specifications include a mix of batched band, sparse iterative, and sparse direct solvers with their accompanying functionality that is already required by the application codes or we anticipated to be needed in the near future. This report summarizes progress in Kokkos Kernels and the xSDK libraries MAGMA, Ginkgo, hypre, PETSc, and SuperLU.

More Details

TYPE Other Report YEAR 2024

DOI OSTI

Parallel, Portable Sparse Code Generation with MLIR and Kokkos

Kelley, Brian M.; Liegeois, Kim A.J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2024

DOI OSTI

Towards Reverse Mode Automatic Differentiation of Kokkos-Based Code Using the LLVM Compiler Infrastructure

Liegeois, Kim A.J.; Kelley, Brian M.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2024

DOI OSTI

Towards reverse mode automatic differentiation of Kokkos-based codes

Liegeois, Kim A.J.; Kelley, Brian M.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and the solving of nonlinear problems. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, facilitating wide-spread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the wide-spread adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures, such as GPUs, with limited memory capabilities and requiring massive thread-level concurrency. C++ AD tools must effectively use these environments to bring novel scientific simulations to next-generation DOE experimental and observational facilities. In this project, we investigated source transformation-based automatic differentiation using LLVM compiler infrastructure to automatically generate portable and efficient gradient computations of Kokkos-based code. We have demonstrated that our proposed strategy is feasible by investigating the usage of a prototype LLVM-based source transformation tool to generate gradients of simple functions made of sequences of simple Kokkos parallel regions. Speedups of up to 500x compared to Sacado were observed on NVIDIA V100 GPU.

More Details

TYPE LDRD Report YEAR 2024

DOI OSTI

Predicting electronic structures at any length scale with machine learning

npj Computational Materials

Fiedler, Lenz; Modine, Normand A.; Schmerler, Steve; Vogel, Dayton J.; Popoola, Gabriel A.; Thompson, A.P.; Rajamanickam, Sivasankaran; Cangi, Attila

The properties of electrons in matter are of fundamental importance. They give rise to virtually all material properties and determine the physics at play in objects ranging from semiconductor devices to the interior of giant gas planets. Modeling and simulation of such diverse applications rely primarily on density functional theory (DFT), which has become the principal method for predicting the electronic structure of matter. While DFT calculations have proven to be very useful, their computational scaling limits them to small systems. We have developed a machine learning framework for predicting the electronic structure on any length scale. It shows up to three orders of magnitude speedup on systems where DFT is tractable and, more importantly, enables predictions on scales where DFT calculations are infeasible. Our work demonstrates how machine learning circumvents a long-standing computational bottleneck and advances materials science to frontiers intractable with any current solutions.

More Details

TYPE Journal Article YEAR 2023

DOI OSTI Scopus

New Linear Solvers Features and Improvements in Trilinos

Loe, Jennifer A.; Boman, Erik G.; Espinoza, Heliezer J.D.; Glusa, Christian; Harper, Graham B.; Higgins, Andrew J.; Rajamanickam, Sivasankaran; Siefert, Christopher; Switzer, Heather M.; Szyld, Daniel; Tuminaro, Raymond S.; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

Performance Portable Batched Sparse Linear Solvers

IEEE Transactions on Parallel and Distributed Systems

Liegeois, Kim A.J.; Rajamanickam, Sivasankaran; Berger-Vergiat, Luc

Solving large number of small linear systems is increasingly becoming a bottleneck in computational science applications. While dense linear solvers for such systems have been studied before, batched sparse linear solvers are just starting to emerge. In this paper, we discuss algorithms for solving batched sparse linear systems and their implementation in the Kokkos Kernels library. The new algorithms are performance portable and map well to the hierarchical parallelism available in modern accelerator architectures. The sparse matrix vector product (SPMV) kernel is the main performance bottleneck of the Krylov solvers we implement in this work. The implementation of the batched SPMV and its performance are therefore discussed thoroughly in this paper. The implemented kernels are tested on different Central Processing Unit (CPU) and Graphic Processing Unit (GPU) architectures. We also develop batched Conjugate Gradient (CG) and batched Generalized Minimum Residual (GMRES) solvers using the batched SPMV. Our proposed solver was able to solve 20,000 sparse linear systems on V100 GPUs with a mean speedup of 76x and 924x compared to using a parallel sparse solver with a block diagonal system with all the small linear systems, and compared to solving the small systems one at a time, respectively. We see mean speedup of 0.51 compared to dense batched solver of cuSOLVER on V100, while using lot less memory. Thorough performance evaluation on three different architectures and analysis of the performance are presented.

More Details

TYPE Journal Article YEAR 2023

DOI OSTI Scopus

Performance Portable Batched Sparse Linear Solvers

IEEE Transactions on Parallel and Distributed Systems

Liegeois, Kim A.J.; Rajamanickam, Sivasankaran; Berger-Vergiat, Luc

Solving large number of small linear systems is increasingly becoming a bottleneck in computational science applications. While dense linear solvers for such systems have been studied before, batched sparse linear solvers are just starting to emerge. In this paper, we discuss algorithms for solving batched sparse linear systems and their implementation in the Kokkos Kernels library. The new algorithms are performance portable and map well to the hierarchical parallelism available in modern accelerator architectures. The sparse matrix vector product (SPMV) kernel is the main performance bottleneck of the Krylov solvers we implement in this work. The implementation of the batched SPMV and its performance are therefore discussed thoroughly in this paper. The implemented kernels are tested on different Central Processing Unit (CPU) and Graphic Processing Unit (GPU) architectures. We also develop batched Conjugate Gradient (CG) and batched Generalized Minimum Residual (GMRES) solvers using the batched SPMV. Our proposed solver was able to solve 20,000 sparse linear systems on V100 GPUs with a mean speedup of 76x and 924x compared to using a parallel sparse solver with a block diagonal system with all the small linear systems, and compared to solving the small systems one at a time, respectively. We see mean speedup of 0.51 compared to dense batched solver of cuSOLVER on V100, while using lot less memory. Thorough performance evaluation on three different architectures and analysis of the performance are presented.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI Scopus

An Experimental Study of Two-Level Schwarz Domain Decomposition Preconditioners on GPUs

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Heinlein, Alexander

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

An Optimistic View for HPC in the Future

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

Jet: Multilevel Partitioning on GPUs

Gilbert, Michael; Madduri, Kamesh; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2023

OSTI

High-Performance GMRES Multi-Precision Benchmark

Yamazaki, Ichitaro; Loe, Jennifer A.; Glusa, Christian; Rajamanickam, Sivasankaran; Luszczek, Piotr; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

An Experimental Study of Two-level Schwarz Domain-Decomposition Preconditioners on GPUs

Proceedings - 2023 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023

Yamazaki, Ichitaro; Heinlein, Alexander; Rajamanickam, Sivasankaran

The generalized Dryja-Smith-Widlund (GDSW) preconditioner is a two-level overlapping Schwarz domain decomposition (DD) preconditioner that couples a classical one-level overlapping Schwarz preconditioner with an energy-minimizing coarse space. When used to accelerate the convergence rate of Krylov subspace iterative methods, the GDSW preconditioner provides robustness and scalability for the solution of sparse linear systems arising from the discretization of a wide range of partial different equations. In this paper, we present FROSch (Fast and Robust Schwarz), a domain decomposition solver package which implements GDSW-type preconditioners for both CPU and GPU clusters. To improve the solver performance on GPUs, we use a novel decomposition to run multiple MPI processes on each GPU, reducing both solver's computational and storage costs and potentially improving the convergence rate. This allowed us to obtain competitive or faster performance using GPUs compared to using CPUs alone. We demonstrate the performance of FROSch on the Summit supercomputer with NVIDIA V100 GPUs, where we used NVIDIA Multi-Process Service (MPS) to implement our decomposition strategy.The solver has a wide variety of algorithmic and implementation choices, which poses both opportunities and challenges for its GPU implementation. We conduct a thorough experimental study with different solver options including the exact or inexact solution of the local overlapping subdomain problems on a GPU. We also discuss the effect of using the iterative variant of the incomplete LU factorization and sparse-triangular solve as the approximate local solver, and using lower precision for computing the whole FROSch preconditioner. Overall, the solve time was reduced by factors of about 2× using GPUs, while the GPU acceleration of the numerical setup time depend on the solver options and the local matrix sizes.

More Details

TYPE Conference Paper YEAR 2023

DOI OSTI Scopus

High-Performance GMRES Multi-Precision Benchmark

Yamazaki, Ichitaro; Loe, Jennifer A.; Glusa, Christian; Rajamanickam, Sivasankaran; Luszczek, Piotr; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Accelerating Selected DOE Machine Learning Workloads on SambaNova Systems

Rajamanickam, Sivasankaran; Eydenberg, Michael S.; Ho, Yang; Liu, Chen; Zhang, Leon; Zhou, Kuan; Sun, Johnl G.J.; Chen, Edison; Deng, Andrew; Wang, Mingran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Predicting the Electronic Structure of Matter on Ultra-Large Scales

Fiedler, Lenz; Modine, Normand A.; Schmerler, Steve; Vogel, Dayton J.; Popoola, Gabriel A.; Thompson, A.P.; Rajamanickam, Sivasankaran; Cangi, Attila

The long-standing problem of predicting the electronic structure of matter on ultra-large scales (beyond 100,000 atoms) is solved with machine learning.

More Details

TYPE Other Report YEAR 2022

DOI OSTI

Half-Precision Scalar Support in Kokkos and Kokkos Kernels: An Engineering Study and Experience Report

Harvey, Evan C.; Milewicz, Reed M.; Trott, Christian R.; Berger-Vergiat, Luc; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Unified Language Frontend for Physic-Informed AI/ML

Kelley, Brian M.; Rajamanickam, Sivasankaran

Artificial intelligence and machine learning (AI/ML) are becoming important tools for scientific modeling and simulation as in several other fields such as image analysis and natural language processing. ML techniques can leverage the computing power available in modern systems and reduce the human effort needed to configure experiments, interpret and visualize results, draw conclusions from huge quantities of raw data, and build surrogates for physics based models. Domain scientists in fields like fluid dynamics, microelectronics and chemistry can automate many of their most difficult and repetitive tasks or improve the design times by use of the faster ML-surrogates. However, modern ML and traditional scientific highperformance computing (HPC) tend to use completely different software ecosystems. While ML frameworks like PyTorch and TensorFlow provide Python APIs, most HPC applications and libraries are written in C++. Direct interoperability between the two languages is possible but is tedious and error-prone. In this work, we show that a compiler-based approach can bridge the gap between ML frameworks and scientific software with less developer effort and better efficiency. We use the MLIR (multi-level intermediate representation) ecosystem to compile a pre-trained convolutional neural network (CNN) in PyTorch to freestanding C++ source code in the Kokkos programming model. Kokkos is a programming model widely used in HPC to write portable, shared-memory parallel code that can natively target a variety of CPU and GPU architectures. Our compiler-generated source code can be directly integrated into any Kokkosbased application with no dependencies on Python or cross-language interfaces.

More Details

TYPE Other Report YEAR 2022

DOI OSTI

Computational Challenges in the development of a surrogate model for Density Functional Theory calculations

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Accelerating Multiscale Materials Modeling with Machine Learning

Modine, Normand A.; Stephens, John A.; Swiler, Laura P.; Thompson, A.P.; Vogel, Dayton J.; Cangi, Attila; Feilder, Lenz; Rajamanickam, Sivasankaran

The focus of this project is to accelerate and transform the workflow of multiscale materials modeling by developing an integrated toolchain seamlessly combining DFT, SNAP, LAMMPS, (shown in Figure 1-1) and a machine-learning (ML) model that will more efficiently extract information from a smaller set of first-principles calculations. Our ML model enables us to accelerate first-principles data generation by interpolating existing high fidelity data, and extend the simulation scale by extrapolating high fidelity data (10² atoms) to the mesoscale (10⁴ atoms). It encodes the underlying physics of atomic interactions on the microscopic scale by adapting a variety of ML techniques such as deep neural networks (DNNs), and graph neural networks (GNNs). We developed a new surrogate model for density functional theory using deep neural networks. The developed ML surrogate is demonstrated in a workflow to generate accurate band energies, total energies, and density of the 298K and 933K Aluminum systems. Furthermore, the models can be used to predict the quantities of interest for systems with more number of atoms than the training data set. We have demonstrated that the ML model can be used to compute the quantities of interest for systems with 100,000 Al atoms. When compared with 2000 Al system the new surrogate model is as accurate as DFT, but three orders of magnitude faster. We also explored optimal experimental design techniques to choose the training data and novel Graph Neural Networks to train on smaller data sets. These are promising methods that need to be explored in the future.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI