Publications Search

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.

More Details

TYPE Journal Article YEAR 2018

DOI OSTI Scopus

Scalable community detection benchmark generation

Berry, Jonathan; Phillips, Cynthia A.; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Scheduling Parallel Tasks using Graph Coloring

Boman, Erik G.; Chen, Chao; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci, Mehmet; Hammond, Simon; Wolf, Michael; Rajamanickam, Sivasankaran

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

Exploiting Geometric Partitioning in Task Mapping for Parallel Computes

Deveci, Mehmet; Devine, Karen; Foulk, James W.; Taylor, Mark A.; Rajamanickam, Sivasankaran; Catalyurek, Umit V.

We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby' cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We also present a number of algorithmic optimizations that exploit specific features of the network or application. We show that, for the structured finite difference mini-application MiniGhost, our mapping methods reduced communication time up to 75% relative to MiniGhost's default mapping on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 32K cores of an IBM BlueGene/Q with contiguous allocation.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

Multi-threaded Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran; Boman, Erik G.; Trott, Christian R.; Hammond, Simon; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaGraph at Sandia: Graph Coloring Clustering and Partitioning for Exascale Computing

Boman, Erik G.; Deveci, Mehmet; Devine, Karen; Rajamanickam, Sivasankaran; Wolf, Michael

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

A Hierarchical Low-Rank Solver for Sparse Linear Systems and Its Variations

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Vector-friendly Batched BLAS and LAPACK Kernels : Design and Applications

Rajamanickam, Sivasankaran; Kim, Kyungjoo; Bradley, Andrew M.; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaGraph: Combinatorial Methods for Enabling Exascale Applications

Author, No; Halappanavar, Mahantesh; Buluc, Aydin; Boman, Erik G.; Pothen, Alex; Tumeo, Antonino; Azad, Ariful; Khan, Arif; Ferdous, Sm; Rajamanickam, Sivasankaran; Wolf, Michael; Deveci, Mehmet; Devine, Karen

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

KokkosKernels Overview

Rajamanickam, Sivasankaran; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ShyLU and Kokkoskernels

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Experimental Design of Work Chunking for Graph Algorithms on High Bandwidth Memory Architectures

Slota, George; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Ensemble Grouping Strategies for Embedded Stochastic Collocation Methods Applied to Anisotropic Diffusion Problems

SIAM/ASA Journal on Uncertainty Quantification

D'Elia, Marta; Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Previous work has demonstrated that propagating groups of samples, called ensembles, together through forward simulations can dramatically reduce the aggregate cost of sampling-based uncertainty propagation methods [E. Phipps, M. D'Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam, SIAM J. Sci. Comput., 39 (2017), pp. C162--C193]. However, critical to the success of this approach when applied to challenging problems of scientific interest is the grouping of samples into ensembles to minimize the total computational work. For example, the total number of linear solver iterations for ensemble systems may be strongly influenced by which samples form the ensemble when applying iterative linear solvers to parameterized and stochastic linear systems. In this paper we explore sample grouping strategies for local adaptive stochastic collocation methods applied to PDEs with uncertain input data, in particular canonical anisotropic diffusion problems where the diffusion coefficient is modeled by truncated Karhunen--Loève expansions. Finally, we demonstrate that a measure of the total anisotropy of the diffusion coefficient is a good surrogate for the number of linear solver iterations for each sample and therefore provides a simple and effective metric for grouping samples.

More Details

TYPE Journal Article YEAR 2018

DOI OSTI

Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures

Deveci, Mehmet; Trott, Christian R.; Rajamanickam, Sivasankaran

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix- matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Kokkoskernels: Portable Math and Graph Kernels

Rajamanickam, Sivasankaran; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Using the Basker Linear Solvers in Xyce

Thornquist, Heidi K.; Mei, Ting; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Tacho: Memory-Scalable Task Parallel Sparse Cholesky Factorization

Kim, Kyungjoo; Edwards, Harold C.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Designing vector-friendly compact BLAS and LAPACK kernels

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus