Publications Search

This report includes a compilation of several slide presentations: 1) Interatomic Potentials for Materials Science and Beyond–Advances in Machine Learned Spectral Neighborhood Analysis Potentials (Wood); 2) Agile Materials Science and Advanced Manufacturing through AI/ML (de Oca Zapiain); 3) Machine Learning for DFT Calculations (Rajamanickam); 4) Structure-preserving ML discovery of a quantum-to-continuum codesign stack (Trask); and 5) IBM Overview of Accelerated Discovery Technology (Pitera)

More Details

TYPE Other Report YEAR 2021

DOI OSTI

Understanding the Design Space of Sparse/Dense Multiphase Dataflows for Mapping Graph Neural Networks on Spatial Accelerators

Garg, Raveesh; Qin, Eric; Martinez, Francisco M.; Guirado, Robert; Jain, Akshay; Abadal, Sergi; Abellan, Jose L.; Acacio, Manuel E.; Alarcon, Eduard; Rajamanickam, Sivasankaran; Krishna, Tushar

Graph Neural Networks (GNNs) have garnered a lot of recent interest because of their success in learning representations from graph-structured data across several critical applications in cloud and HPC. Owing to their unique compute and memory characteristics that come from an interplay between dense and sparse phases of computations, the emergence of reconfigurable dataflow (aka spatial) accelerators offers promise for acceleration by mapping optimized dataflows (i.e., computation order and parallelism) for both phases. The goal of this work is to characterize and understand the design-space of dataflow choices for running GNNs on spatial accelerators in order for the compilers to optimize the dataflow based on the workload. Specifically, we propose a taxonomy to describe all possible choices for mapping the dense and sparse phases of GNNs spatially and temporally over a spatial accelerator, capturing both the intra-phase dataflow and the inter-phase (pipelined) dataflow. Using this taxonomy, we do deep-dives into the cost and benefits of several dataflows and perform case studies on implications of hardware parameters for dataflows and value of flexibility to support pipelined execution.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems

Parallel Computing

Acer, Seher; Boman, Erik G.; Glusa, Christian; Rajamanickam, Sivasankaran

Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI Scopus

The Kokkos EcoSystem: Comprehensive Performance Portability for High Performance Computing

Computing in Science and Engineering

Trott, Christian R.; Berger-Vergiat, Luc; Poliakoff, David; Rajamanickam, Sivasankaran; Lebrun-Grandie, Damien; Madsen, Jonathan; Al Awar, Nader; Gligoric, Milos; Shipman, Galen; Womeldorff, Geoff

State-of-the-art engineering and science codes have grown in complexity dramatically over the last two decades. Application teams have adopted more sophisticated development strategies, leveraging third party libraries, deploying comprehensive testing, and using advanced debugging and profiling tools. In today's environment of diverse hardware platforms, these applications also desire performance portability-avoiding the need to duplicate work for various platforms. The Kokkos EcoSystem provides that portable software stack. Based on the Kokkos Core Programming Model, the EcoSystem provides math libraries, interoperability capabilities with Python and Fortran, and Tools for analyzing, debugging, and optimizing applications. In this article, we overview the components, discuss some specific use cases, and highlight how codesigning these components enables a more developer friendly experience.

More Details

TYPE Journal Article YEAR 2021

DOI OSTI Scopus DOI OSTI Scopus

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

Jeong, Geonhwa; Kestor, Gokcen; Chatarsi, Prasanth; Parashar, Angshuman; Tsai, Po-An; Rajamanickam, Sivasankaran; Gioiosa, Roberto; Krishna, Tushar

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Advances in Mixed Precision Algorithms: 2021 Edition

Abdelfattah, Ahmad; Anzt, Hartwig; Ayala, Alan; Boman, Erik G.; Carson, Erin C.; Cayrols, Sebastien; Cojean, Terry; Dongarra, Jack J.; Falgout, Rob; Gates, Mark; G, R\{U}Tzmacher; Higham, Nicholas J.; Kruger, Scott E.; Li, Sherry; Lindquist, Neil; Liu, Yang; Loe, Jennifer A.; Nayak, Pratik; Osei-Kuffuor, Daniel; Pranesh, Sri; Rajamanickam, Sivasankaran; Ribizel, Tobias; Smith, Bryce; Swirydowicz, Kasia; Thomas, Stephen J.; Tomov, Stanimire; Tsai, Yaohung M.; Yamazaki, Ichitaro; Yang, Urike M.

Over the last year, the ECP xSDK-multiprecision effort has made tremendous progress in developing and deploying new mixed precision technology and customizing the algorithms for the hardware deployed in the ECP flagship supercomputers. The effort also has succeeded in creating a cross-laboratory community of scientists interested in mixed precision technology and now working together in deploying this technology for ECP applications. In this report, we highlight some of the most promising and impactful achievements of the last year. Among the highlights we present are: Mixed precision IR using a dense LU factorization and achieving a 1.8× speedup on Spock; results and strategies for mixed precision IR using a sparse LU factorization; a mixed precision eigenvalue solver; Mixed Precision GMRES-IR being deployed in Trilinos, and achieving a speedup of 1.4× over standard GMRES; compressed Basis (CB) GMRES being deployed in Ginkgo and achieving an average 1.4× speedup over standard GMRES; preparing hypre for mixed precision execution; mixed precision sparse approximate inverse preconditioners achieving an average speedup of 1.2×; and detailed description of the memory accessor separating the arithmetic precision from the memory precision, and enabling memory-bound low precision BLAS 1/2 operations to increase the accuracy by using high precision in the computations without degrading the performance. We emphasize that many of the highlights presented here have also been submitted to peer-reviewed journals or established conferences, and are under peer-review or have already been published.

More Details

TYPE Other Report YEAR 2021

DOI OSTI DOI OSTI

Sphynx: a parallel multi-GPU graph partitioner

Acer, Seher; Boman, Erik G.; Glusa, Christian; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Kokkos 3: Programming Model Extensions for the Exascale Era

IEEE Transactions on Parallel and Distributed Systems

Trott, Christian R.; Lebrun-Grandie, Damien; Arndt, Daniel; Ciesko, Jan; Dang, Vinh Q.; Ellingwood, Nathan D.; Gayatri, Rahulkumar; Harvey, Evan C.; Hollman, Daisy S.; Ibanez, Daniel A.; Liber, Nevin; Madsen, Jonathan; Miles, Jeff S.; Poliakoff, David; Powell, Amy J.; Rajamanickam, Sivasankaran; Simberg, Mikael; Sunderland, Dan; Turcksin, Bruno; Wilke, Jeremiah

As the push towards exascale hardware has increased the diversity of system architectures, performance portability has become a critical aspect for scientific software. We describe the Kokkos Performance Portable Programming Model that allows developers to write single source applications for diverse high performance computing architectures. Kokkos provides key abstractions for both the compute and memory hierarchy of modern hardware. Here, we describe the novel abstractions that have been added to Kokkos recently such as hierarchical parallelism, containers, task graphs, and arbitrary-sized atomic operations. We demonstrate the performance of these new features with reproducible benchmarks on CPUs and GPUs.

More Details

TYPE Journal Article YEAR 2021

DOI OSTI

Evaluating Spatial Accelerator Architectures with Tiled Matrix-Matrix Multiplication

Moon, Gordon E.; Kwon, Hyoukjun; Jeong, Geonhwa; Chatarasi, Prasanth; Rajamanickam, Sivasankaran; Krishna, Tushar

There is a growing interest in custom spatial accelerators for machine learning applications. These accelerators employ a spatial array of processing elements (PEs) interacting via custom buffer hierarchies and networks-on-chip. The efficiency of these accelerators comes from employing optimized dataflow (i.e., spatial/temporal partitioning of data across the PEs and fine-grained scheduling) strategies to optimize data reuse. The focus of this work is to evaluate these accelerator architectures using a tiled general matrix-matrix multiplication (GEMM) kernel. To do so, we develop a framework that finds optimized mappings (dataflow and tile sizes) for a tiled GEMM for a given spatial accelerator and workload combination, leveraging an analytical cost model for runtime and energy. Our evaluations over five spatial accelerators demonstrate that the tiled GEMM mappings systematically generated by our framework achieve high performance on various GEMM workloads and accelerators.

More Details

TYPE Other Report YEAR 2021

DOI DOI OSTI OSTI

Learning Transferable DFT Neural Network Surrogates

Lennon, Kyle; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Kokkos Kernels 3.4

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Harvey, Evan C.; Kelley, Brian M.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Properties of GMRES with Iterative Refinement on GPUs

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops Ipdpsw 2021 in Conjunction with IEEE IPDPS 2021

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

Accelerating Finite-Temperature Kohn-Sham Density Functional Theory with Deep Neural Networks

Ellis, J.A.; Fielder, Lenz; Popoola, Gabriel A.; Modine, Normand; Stephens, John A.; Thompson, Aidan P.; Rajamanickam, Sivasankaran

We present a numerical modeling workflow based on machine learning (ML) which reproduces the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spatially-resolved, energy-resolved, and integrated quantities can be calculated, including the DFT total free energy, which serves as the Born-Oppenheimer potential energy surface for the atoms. We demonstrate the efficacy of this approach for both solid and liquid metals and compare results between independent and unified machine-learning models for solid and liquid aluminum. Our machine-learning density functional theory framework opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms.

More Details

TYPE Other Report YEAR 2021

DOI OSTI DOI OSTI

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021

Loe, Jennifer A.; Glusa, Christian; Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI Scopus

Extending sparse tensor accelerators to support multiple compression formats

Proceedings - 2021 IEEE 35th International Parallel and Distributed Processing Symposium, IPDPS 2021

Qin, Eric; Jeong, Geonhwa; Won, William; Kao, Sheng C.; Kwon, Hyoukjun; Das, Dipankar; Moon, Gordon E.; Rajamanickam, Sivasankaran; Krishna, Tushar

Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ∼ 4 × speedup over performing format conversions in software.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

Exagraph: Combinatorial Methods for Enabling Exascale Science

Halappanavar, Mahantesh; Acer, Seher; Boman, Erik G.; Buluc, Aydin; Ekanayate, Saliya; Feerdous, Sm; Gawande, Nitin; Ghosh, Sayan; Khan, Arif; Minotoli, Marco; Pothen, Alex; Rajamanickam, Sivasankaran; Selvitopi, Oguz; Tallent, Nathan; Tumeo, Antonio

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI