Publications Search

Towards Reverse Mode Automatic Differentiation of Kokkos-Based Code Using the LLVM Compiler Infrastructure

Liegeois, Kim A.J.; Kelley, Brian M.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2024

DOI OSTI

Parallel, Portable Sparse Code Generation with MLIR and Kokkos

Kelley, Brian M.; Liegeois, Kim A.J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2024

DOI OSTI

Towards reverse mode automatic differentiation of Kokkos-based codes

Liegeois, Kim A.J.; Kelley, Brian M.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Derivative computation is a key component of optimization, sensitivity analysis, uncertainty quantification, and the solving of nonlinear problems. Automatic differentiation (AD) is a powerful technique for evaluating such derivatives, and in recent years, has been integrated into programming environments such as Jax, PyTorch, and TensorFlow to support derivative computations needed for training of machine learning models, facilitating wide-spread use of these technologies. The C++ language has become the de facto standard for scientific computing due to numerous factors, yet language complexity has made the wide-spread adoption of AD technologies for C++ difficult, hampering the incorporation of powerful differentiable programming approaches into C++ scientific simulations. This is exacerbated by the increasing emergence of architectures, such as GPUs, with limited memory capabilities and requiring massive thread-level concurrency. C++ AD tools must effectively use these environments to bring novel scientific simulations to next-generation DOE experimental and observational facilities. In this project, we investigated source transformation-based automatic differentiation using LLVM compiler infrastructure to automatically generate portable and efficient gradient computations of Kokkos-based code. We have demonstrated that our proposed strategy is feasible by investigating the usage of a prototype LLVM-based source transformation tool to generate gradients of simple functions made of sequences of simple Kokkos parallel regions. Speedups of up to 500x compared to Sacado were observed on NVIDIA V100 GPU.

More Details

TYPE LDRD Report YEAR 2024

DOI OSTI

Unified Language Frontend for Physic-Informed AI/ML

Kelley, Brian M.; Rajamanickam, Sivasankaran

Artificial intelligence and machine learning (AI/ML) are becoming important tools for scientific modeling and simulation as in several other fields such as image analysis and natural language processing. ML techniques can leverage the computing power available in modern systems and reduce the human effort needed to configure experiments, interpret and visualize results, draw conclusions from huge quantities of raw data, and build surrogates for physics based models. Domain scientists in fields like fluid dynamics, microelectronics and chemistry can automate many of their most difficult and repetitive tasks or improve the design times by use of the faster ML-surrogates. However, modern ML and traditional scientific highperformance computing (HPC) tend to use completely different software ecosystems. While ML frameworks like PyTorch and TensorFlow provide Python APIs, most HPC applications and libraries are written in C++. Direct interoperability between the two languages is possible but is tedious and error-prone. In this work, we show that a compiler-based approach can bridge the gap between ML frameworks and scientific software with less developer effort and better efficiency. We use the MLIR (multi-level intermediate representation) ecosystem to compile a pre-trained convolutional neural network (CNN) in PyTorch to freestanding C++ source code in the Kokkos programming model. Kokkos is a programming model widely used in HPC to write portable, shared-memory parallel code that can natively target a variety of CPU and GPU architectures. Our compiler-generated source code can be directly integrated into any Kokkosbased application with no dependencies on Python or cross-language interfaces.

More Details

TYPE Other Report YEAR 2022

DOI OSTI

Mixed Precision s-step Conjugate Gradient with Residual Replacements on GPUs

Yamazaki, Ichitaro; Carson, Erin; Kelley, Brian M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Kokkos Kernels (Sake project)

Berger-Vergiat, Luc; Rajamanickam, Sivasankaran; Dang, Vinh Q.; Kelley, Brian M.; Ellingwood, Nathan D.; Loe, Jennifer A.; Harvey, Evan C.; Pearson, Carl; Foucar, James G.; Liegeois, Kim A.J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2022

DOI OSTI

Kokkos Kernels Math Library

Berger-Vergiat, Luc; Rajamanickam, Sivasankaran; Loe, Jennifer A.; Kelley, Brian M.; Harvey, Evan C.; Foucar, James G.; Ellingwood, Nathan D.; Dang, Vinh Q.; Liegeois, Kim A.J.; Pearson, Carl

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

Kelley, Brian M.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Kelley, Brian M.; Rajamanickam, Sivasankaran

Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performance on modern accelerator hardware. This algorithm is implemented using the Kokkos programming model to enable performance portability. We demonstrate the portability of the algorithm and the performance on a variety of architectures (x86/ARM CPUs and NVIDIA/AMD GPUs). The resulting algorithm is also deterministic, producing an identical result for a given input across all of these platforms. The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results. We further demonstrate the benefits of this approach by developing parallel graph coarsening scheme for two different use cases. First, we develop an algebraic multigrid (AMG) aggregation scheme using parallel MIS-2 and demonstrate the benefits as opposed to previous approaches used in the MueLu multigrid package in Trilinos. We also describe an approach for implementing a parallel multicolor 'cluster' Gauss-Seidel preconditioner using this MIS-2 coarsening, and demonstrate better performance with an efficient, parallel, mul-ticolor Gauss-Seidel algorithm.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI Scopus

Mixed Precision s-step Conjugate Gradient with Residual Replacement on GPUs

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Yamazaki, Ichitaro; Carson, Erin; Kelley, Brian M.

The s-step Conjugate Gradient (CG) algorithm has the potential to reduce the communication cost of standard CG by a factor of s. However, though mathematically equivalent, s-step CG may be numerically less stable compared to standard CG in finite precision, exhibiting slower convergence and decreased attainable accuracy. This limits the use of s-step CG in practice. To improve the numerical behavior of s-step CG and overcome this potential limitation, we incorporate two techniques. First, we improve convergence behavior through the use of higher precision at critical parts of the s-step iteration and second, we integrate a residual replacement strategy into the resulting mixed precision s-step CG to improve attainable accuracy. Our experimental results on the Summit Supercomputer demonstrate that when the higher precision is implemented in hardware, these techniques have virtually no overhead on the iteration time while improving both the convergence rate and the attainable accuracy of s-step CG. Even when the higher precision is implemented in software, these techniques may still reduce the time-to-solution (speedups of up to 1.8times in our experiments), especially when s-step CG suffers from numerical instability with a small step size and the latency cost becomes a significant part of its iteration time.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

Trilinos Support on AMD and Intel GPUs

Kelley, Brian M.; Berger-Vergiat, Luc

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

Kelley, Brian M.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI