Publications

Results 76–100 of 354

Search results

Jump to search filters

A Taxonomy for Classification and Comparison of Dataflows for GNN Accelerators

Garg, Raveesh; Qin, Eric; Martinez, Francisco M.; Guirado, Robert; Jain, Akshay; Abadal, Sergi; Abellan, Jose L.; Acacio, Manuel E.; Alarcon, Eduard; Rajamanickam, Sivasankaran; Krishna, Tushar

Recently, Graph Neural Networks (GNNs) have received a lot of interest because of their success in learning representations from graph structured data. However, GNNs exhibit different compute and memory characteristics compared to traditional Deep Neural Networks (DNNs). Graph convolutions require feature aggregations from neighboring nodes (known as the aggregation phase), which leads to highly irregular data accesses. GNNs also have a very regular compute phase that can be broken down to matrix multiplications (known as the combination phase). All recently proposed GNN accelerators utilize different dataflows and microarchitecture optimizations for these two phases. Different communication strategies between the two phases have been also used. However, as more custom GNN accelerators are proposed, the harder it is to qualitatively classify them and quantitatively contrast them. In this work, we present a taxonomy to describe several diverse dataflows for running GNN inference on accelerators. This provides a structured way to describe and compare the design-space of GNN accelerators.

More Details

Using MLIR Framework for Codesign of ML Architectures Algorithms and Simulation Tools

Lewis, Cannada; Hughes, Clayton; Hammond, Simon; Rajamanickam, Sivasankaran

MLIR (Multi-Level Intermediate Representation), is an extensible compiler framework that supports high-level data structures and operation constructs. These higher-level code representations are particularly applicable to the artificial intelligence and machine learning (AI/ML) domain, allowing developers to more easily support upcoming heterogeneous AI/ML accelerators and develop flexible domain specific compilers/frameworks with higher-level intermediate representations (IRs) and advanced compiler optimizations. The result of using MLIR within the LLVM compiler framework is expected to yield significant improvement in the quality of generated machine code, which in turn will result in improved performance and hardware efficiency

More Details

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics

Dang, Vinh Q.; Kotulski, Joseph D.; Rajamanickam, Sivasankaran

Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.

More Details

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Dang, Vinh Q.; Kotulski, Joseph D.; Rajamanickam, Sivasankaran

Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.

More Details

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

Jeong, Geonhwa; Kestor, Gokcen; Chatarasi, Prasanth; Parashar, Angshuman; Tsai, Po A.; Rajamanickam, Sivasankaran; Gioiosa, Roberto; Krishna, Tushar

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW codesign ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations (CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.

More Details

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

Parallel Architectures and Compilation Techniques Conference Proceedings Pact

Jeong, Geonhwa; Kestor, Gokcen; Chatarasi, Prasanth; Parashar, Angshuman; Tsai, Po A.; Rajamanickam, Sivasankaran; Gioiosa, Roberto; Krishna, Tushar

To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW codesign ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations (CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.

More Details

FROSch Preconditioners for Land Ice Simulations of Greenland and Antarctica

Heinlein, Alexander; Perego, Mauro; Rajamanickam, Sivasankaran

Numerical simulations of Greenland and Antarctic ice sheets involve the solution of large-scale highly nonlinear systems of equations on complex shallow geometries. This work is concerned with the construction of Schwarz preconditioners for the solution of the associated tangent problems, which are challenging for solvers mainly because of the strong anisotropy of the meshes and wildly changing boundary conditions that can lead to poorly constrained problems on large portions of the domain. Here, two-level GDSW (Generalized Dryja–Smith–Widlund) type Schwarz preconditioners are applied to different land ice problems, i.e., a velocity problem, a temperature problem, as well as the coupling of the former two problems. We employ the MPI-parallel implementation of multi-level Schwarz preconditioners provided by the package FROSch (Fast and Robust Schwarz)from the Trilinos library. The strength of the proposed preconditioner is that it yields out-of-the-box scalable and robust preconditioners for the single physics problems. To our knowledge, this is the first time two-level Schwarz preconditioners are applied to the ice sheet problem and a scalable preconditioner has been used for the coupled problem. The pre-conditioner for the coupled problem differs from previous monolithic GDSW preconditioners in the sense that decoupled extension operators are used to compute the values in the interior of the sub-domains. Several approaches for improving the performance, such as reuse strategies and shared memory OpenMP parallelization, are explored as well. In our numerical study we target both uniform meshes of varying resolution for the Antarctic ice sheet as well as non uniform meshes for the Greenland ice sheet are considered. We present several weak and strong scaling studies confirming the robustness of the approach and the parallel scalability of the FROSch implementation. Among the highlights of the numerical results are a weak scaling study for up to 32 K processor cores (8 K MPI-ranks and 4 OpenMP threads) and 566 M degrees of freedom for the velocity problem as well as a strong scaling study for up to 4 K processor cores (and MPI-ranks) and 68 M degrees of freedom for the coupled problem.

More Details

Scalable asynchronous domain decomposition solvers

SIAM Journal on Scientific Computing

Glusa, Christian; Boman, Erik G.; Chow, Edmond; Rajamanickam, Sivasankaran; Szyld, Daniel B.

Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes and more heterogeneous compute architectures make load balancing and the design of low latency network interconnects that are able to satisfy the communication requirements of linear solvers very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchronous communication based on one-sided Message Passing Interface primitives in the context of domain decomposition solvers. In particular, a scalable asynchronous two-level Schwarz method is presented. We discuss practical issues encountered in the development of a scalable solver and show experimental results obtained on a state-of-the-art supercomputer system that illustrate the benefits of asynchronous solvers in load balanced as well as load imbalanced scenarios. Using the novel method, we can observe speedups of up to four times over its classical synchronous equivalent.

More Details
Results 76–100 of 354
Results 76–100 of 354