Publications Search

We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by every processor. We present various numerical results to demonstrate the versatility and scalability of the parallel algorithm.

More Details

TYPE Journal Article YEAR 2018

DOI OSTI Scopus

Scalable community detection benchmark generation

Berry, Jonathan; Phillips, Cynthia A.; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Scheduling Parallel Tasks using Graph Coloring

Boman, Erik G.; Chen, Chao; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci, Mehmet; Hammond, Simon; Wolf, Michael; Rajamanickam, Sivasankaran

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

Exploiting Geometric Partitioning in Task Mapping for Parallel Computes

Deveci, Mehmet; Devine, Karen; Foulk, James W.; Taylor, Mark A.; Rajamanickam, Sivasankaran; Catalyurek, Umit V.

We present a new method for mapping applications' MPI tasks to cores of a parallel computer such that applications' communication time is reduced. We address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network, although our methods generalize to contiguous allocations as well. The goal is to assign tasks to cores so that interdependent tasks are performed by "nearby' cores, thus lowering the distance messages must travel, the amount of congestion in the network, and the overall cost of communication. Our new method applies a geometric partitioning algorithm to both the tasks and the processors, and assigns task parts to the corresponding processor parts. We also present a number of algorithmic optimizations that exploit specific features of the network or application. We show that, for the structured finite difference mini-application MiniGhost, our mapping methods reduced communication time up to 75% relative to MiniGhost's default mapping on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 32K cores of an IBM BlueGene/Q with contiguous allocation.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

Multi-threaded Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran; Boman, Erik G.; Trott, Christian R.; Hammond, Simon; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaGraph at Sandia: Graph Coloring Clustering and Partitioning for Exascale Computing

Boman, Erik G.; Deveci, Mehmet; Devine, Karen; Rajamanickam, Sivasankaran; Wolf, Michael

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

A Hierarchical Low-Rank Solver for Sparse Linear Systems and Its Variations

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Vector-friendly Batched BLAS and LAPACK Kernels : Design and Applications

Rajamanickam, Sivasankaran; Kim, Kyungjoo; Bradley, Andrew M.; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaGraph: Combinatorial Methods for Enabling Exascale Applications

Author, No; Halappanavar, Mahantesh; Buluc, Aydin; Boman, Erik G.; Pothen, Alex; Tumeo, Antonino; Azad, Ariful; Khan, Arif; Ferdous, Sm; Rajamanickam, Sivasankaran; Wolf, Michael; Deveci, Mehmet; Devine, Karen

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

KokkosKernels Overview

Rajamanickam, Sivasankaran; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ShyLU and Kokkoskernels

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Experimental Design of Work Chunking for Graph Algorithms on High Bandwidth Memory Architectures

Slota, George; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Ensemble Grouping Strategies for Embedded Stochastic Collocation Methods Applied to Anisotropic Diffusion Problems

SIAM/ASA Journal on Uncertainty Quantification

D'Elia, Marta; Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Previous work has demonstrated that propagating groups of samples, called ensembles, together through forward simulations can dramatically reduce the aggregate cost of sampling-based uncertainty propagation methods [E. Phipps, M. D'Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam, SIAM J. Sci. Comput., 39 (2017), pp. C162--C193]. However, critical to the success of this approach when applied to challenging problems of scientific interest is the grouping of samples into ensembles to minimize the total computational work. For example, the total number of linear solver iterations for ensemble systems may be strongly influenced by which samples form the ensemble when applying iterative linear solvers to parameterized and stochastic linear systems. In this paper we explore sample grouping strategies for local adaptive stochastic collocation methods applied to PDEs with uncertain input data, in particular canonical anisotropic diffusion problems where the diffusion coefficient is modeled by truncated Karhunen--Loève expansions. Finally, we demonstrate that a measure of the total anisotropy of the diffusion coefficient is a good surrogate for the number of linear solver iterations for each sample and therefore provides a simple and effective metric for grouping samples.

More Details

TYPE Journal Article YEAR 2018

DOI OSTI

Multi-threaded Sparse Matrix Sparse Matrix Multiplication for Many-Core and GPU Architectures

Deveci, Mehmet; Trott, Christian R.; Rajamanickam, Sivasankaran

Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix- matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Kokkoskernels: Portable Math and Graph Kernels

Rajamanickam, Sivasankaran; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Using the Basker Linear Solvers in Xyce

Thornquist, Heidi K.; Mei, Ting; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Tacho: Memory-Scalable Task Parallel Sparse Cholesky Factorization

Kim, Kyungjoo; Edwards, Harold C.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Designing vector-friendly compact BLAS and LAPACK kernels

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

SIMD Scalar Types for Outer-loop Vectorization

Phipps, Eric T.; Kim, Kyungjoo; Rajamanickam, Sivasankaran; Tupek, Michael R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Applications of Compact Batched Kernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Trilinos Linear Solver Product Inception Deck

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Kokkoskernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

KokkosKernels: Performance-Portable Sparse Dense and Graph Kernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Hoemmen, Mark F.; Hammond, Simon; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Basker : A Threaded Sparse LU factorization utilizing Hierarchical Parallelism and Data Layouts

Rajamanickam, Sivasankaran; Thornquist, Heidi K.; Booth, Joshua

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Partitioning Trillion-edge Graphs in Minutes

Slota, George; Rajamanickam, Sivasankaran; Devine, Karen; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Rajamanickam, Sivasankaran; Story, Shane; Knepper, Sarah; Guney, Murat; Hammond, Simon; Bradley, Andrew M.; Deveci, Mehmet; Costa, Tim; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Fast linear algebra-based triangle counting with KokkosKernels

2017 IEEE High Performance Extreme Computing Conference, HPEC 2017

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

KKTri: Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kim, Kyungjoo; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Asynchronous Iterative Solvers: The ACHILES library and domain decomposition methods

Boman, Erik G.; Rajamanickam, Sivasankaran; Glusa, Christian; Chow, Edmond; Ramanan, Paritosh

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

A Parallel Hierarchical Low-Rank Solver for General Sparse Matrices

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

A Hierarchical Low-Rank Solver for Large Sparse Linear Systems

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Portable Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Trott, Christian R.; Hammond, Simon; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Task Placement to Reduce Application Communication Costs

Devine, Karen; Brandt, James M.; Deveci, Mehmet; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Foulk, James W.; Rajamanickam, Sivasankaran; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Performance-portable sparse matrix-matrix multiplication for many-core architectures

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Deveci, Mehmet; Trott, Christian R.; Rajamanickam, Sivasankaran

We consider the problem of writing performance portablesparse matrix-sparse matrix multiplication (SPGEMM) kernelfor many-core architectures. We approach the SPGEMMkernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design ahierarchical, memory-efficient SPGEMM algorithm. We thendesign and implement thread scalable data structures thatenable us to develop a portable SPGEMM implementation. We show that the method achieves performance portabilityon massively threaded architectures, namely Intel's KnightsLanding processors (KNLs) and NVIDIA's Graphic ProcessingUnits (GPUs), by comparing its performance to specializedimplementations. Second, we study an important aspectof SPGEMM's usage in practice by reusing the structure ofinput matrices, and show speedups up to 3× compared to thebest specialized implementation on KNLs. We demonstratethat the portable method outperforms 4 native methods on2 different GPU architectures (up to 17× speedup), and it ishighly thread scalable on KNLs, in which it obtains 101× speedup on 256 threads.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

The in-memory graph layout affects performance of distributed-memory graph computations. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we consider one-dimensional graph layouts, where disjoint sets of vertices and their adjacencies are partitioned among processors. Using the PuLP graph partitioning method and a breadth-first search (BFS)-based vertex ordering strategy, we empirically evaluate the impact of this graph layout on a collection of five distributed-memory graph computations. Our evaluation considers several objective metrics in addition to execution time, and we observe a considerable performance improvement over randomization.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures

SIAM Journal on Scientific Computing

Phipps, Eric T.; Edwards, Harold C.; Hoemmen, Mark F.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

ShyLU: A Collection of Node-Scalable Sparse Linear Solvers

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Boman, Erik G.; Deveci, Mehmet

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Enabling Low Mach Fluid Simulations Using Trilinos

Hu, Jonathan J.; Devine, Karen; Hoemmen, Mark F.; Lin, Paul T.; Rajamanickam, Sivasankaran; Roberts, Nathan V.; Siefert, Christopher; Trott, Christian R.; Prokopenko, Andrey

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

A Hierarchical Low-rank Solver for Sparse Linear Systems

Boman, Erik G.; Chen, Chao; Darve, Eric; Pouransari, Hadi; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Hierarchical Matrices and Low-Rank Methods for Extreme-Scale Solvers

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Sparse Matrix-matrix multiplication for modern manycore architecture

Deveci, Mehmet; Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

FastILU: Finegrained ASynchronous iterative ILU

Boman, Erik G.; Patel, Aftab; Chow, Edmond; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance Portable Sparse Matrix-Matrix Multiplication on Intel Knights Landing and NVIDIA GPUs

Rajamanickam, Sivasankaran; Deveci, Mehmet; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Tacho: Two-level Task Parallel Cholesky Factorization

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Dohrmann, Clark R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Scalable Incomplete Factorization Utilizing Combinatorial Methods to Reduce Overheads

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Distributing linear systems for parallel computation

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Kokkos Task API: A Use Case in Tacho

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Trilinos NGP Planning

Rajamanickam, Sivasankaran; Devine, Karen; Hu, Jonathan J.; Hoemmen, Mark F.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Sparse Matrix-matrix multiplication for modern manycore architecture

Deveci, Mehmet; Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

KokkosKernels Introduction: Design API and Performance

Deveci, Mehmet; Rajamanickam, Sivasankaran; Kim, Kyungjoo; Bradley, Andrew M.; Trott, Christian R.; Hoemmen, Mark F.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan; Mackey, Greg E.; Rajamanickam, Sivasankaran; Wolf, Michael; Kim, Kyungjoo; Stelle, George W.

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factorization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG. Major challenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Parallel Graph Coloring for Manycore Architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Deveci, Mehmet; Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran

Graph algorithms are challenging to parallelize on manycore architectures due to complex data dependencies and irregular memory access. We consider the well studied problem of coloring the vertices of a graph. In many applications it is important to compute a coloring with few colors in near-lineartime. In parallel, the optimistic (speculative) coloring method by Gebremedhin and Manne is the preferred approach but it needs to be modified for manycore architectures. We discuss a range of implementation issues for this vertex-based optimistic approach. We also propose a novel edge-based optimistic approach that has more parallelism and is better suited to GPUs. We study the performance empirically on two architectures(Xeon Phi and GPU) and across many data sets (from finite element problems to social networks). Our implementation uses the Kokkos library, so it is portable across platforms. We show that on GPUs, we significantly reduce the number of colors (geometric mean 4X, but up to 48X) as compared to the widely used cuSPARSE library. In addition, our edge-based algorithm is 1.5 times faster on average than cuSPARSE, where it hasspeedups up to 139X on a circuit problem. We also show the effect of the coloring on a conjugate gradient solver using multi-colored Symmetric Gauss-Seidel method as preconditioner, the higher coloring quality found by the proposed methods reduces the overall solve time up to 33% compared to cuSPARSE.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Basker: A threaded sparse LU factorization utilizing hierarchical parallelism and data layouts

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Rajamanickam, Sivasankaran; Thornquist, Heidi K.

Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

A comparison of high-level programming choices for incomplete sparse factorization across different architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Kim, Kyungjoo; Rajamanickam, Sivasankaran

All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel-for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Sparse Matrix-Matrix Multiplication for Modern Architectures

Deveci, Mehmet; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Robust Solvers for Circuit Simulation on Modern Architectures

Rajamanickam, Sivasankaran; Booth, Joshua; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Partitioning and Task Placement with Zoltan2

Deveci, Mehmet; Devine, Karen; Boman, Erik G.; Leung, Vitus J.; Rajamanickam, Sivasankaran; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Parallel Graph Coloring for Many Core Architectures

Deveci, Mehmet; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Task and Data Parallelism Based Direct Solvers and Preconditioners in Manycore Architecture: Efforts in Trilinos/ShyLU

Booth, Joshua D.; Rajamanickam, Sivasankaran; Bradley, Andrew M.; Boman, Erik G.; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Ensemble Grouping strategies for embedded Stochastic Collocation methods applied to anisotropic diffusion problems

Edwards, Harold C.; Hu, Jonathan J.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Multi-Jagged: A Scalable Parallel Spatial Partitioning Algorithm

IEEE Transactions on Parallel and Distributed Systems

Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen; Catalyurek, Umit V.

Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficient implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

Parallel Preconditioners and Solvers for Modern Architectures

Boman, Erik G.; Deveci, Mehmet; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Embedded Ensemble Propagation for Improving Performance Portability and Scalability of Uncertainty Quantification on Emerging Computational Architectures

Phipps, Eric T.; Edwards, Harold C.; Hoemmen, Mark F.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance Portability for Linear Algebra with Kokkos

Trott, Christian R.; Edwards, Harold C.; Ellingwood, Nathan D.; Hammond, Simon; Deveci, Mehmet; Boman, Erik G.; Bradley, Andrew M.; Hoemmen, Mark F.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Dungeon Session Application: BDDC Solver Library

Hammond, Simon; Dohrmann, Clark R.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George W.; Edwards, Harold C.; Olivier, Stephen L.

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, I.; Boman, Erik G.; Prokopenko, Andrey V.; Heroux, Michael A.; Dongarra, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Booth, Joshua D.; Rajamanickam, Sivasankaran; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Task-parallel Sparse Incomplete Cholesky Factorization using Kokkos Portable APIs

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

ShyLU and Thread Scalable Subdomain Solvers

Rajamanickam, Sivasankaran; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Deveci, Mehmet; Kim, Kyungjoo; Dohrmann, Clark R.; Thornquist, Heidi K.; Chow, Edmond; Patel, Aftab

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ShyLU: On node Solvers and Kokkos-Kernels

Rajamanickam, Sivasankaran; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Kim, Kyungjoo; Deveci, Mehmet

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, Ichitaro; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stan; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Communication-Avoiding Preconditioners for s-step Krylov Methods

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Basker: A Scalable Sparse Direct Linear Solver for Many-Core Architectures

Booth, Joshua D.; Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Architecture-aware Task Placement

Deveci, Mehmet; Devine, Karen; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

WebGraphAnalysisontheBlueWaters Supercomputer

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Irregular Graph Algorithms on Parallel Processing Systems

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

High-Performance Graph Analytics on Manycore Processors

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI Scopus

High-Performance Computing for Extreme-Scale Data Analytics

Boman, Erik G.; Madduri, Kamesh; Rajamanickam, Sivasankaran; Wolf, Michael

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Distributing Linear Systems for Parallel Computation

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Supercomputing for Web Graph Analytics

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Embedded Sampling?Based Uncertainty Quantification Approaches for Emerging Computer Architectures

D'Elia, Marta; Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Parallel Graph Coloring

Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, Ichitaro; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Exploring Embedded Uncertainty Quantification Methods on Next-Generation Computer Architectures

Phipps, Eric T.; D'Elia, Marta; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

The Zoltan2 Toolkit: Partitioning Task Placement Coloring and Ordering

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran; Leung, Vitus J.; Riesen, Lee A.; Deveci, Mehmet; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

A hybrid approach for parallel transistor-level full-chip circuit simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Thornquist, Heidi K.; Rajamanickam, Sivasankaran

The computer-aided design (CAD) applications that are fundamental to the electronic design automation industry need to harness the available hardware resources to be able to perform full-chip simulation for modern technology nodes (45nm and below). We will present a hybrid (MPI+threads) approach for parallel transistor-level transient circuit simulation that achieves scalable performance for some challenging large-scale integrated circuits. This approach focuses on the computationally expensive part of the simulator: the linear system solve. Hybrid versions of two iterative linear solver strategies are presented, one takes advantage of block triangular form structure while the other uses a Schur complement technique. Results indicate up to a 27x improvement in total simulation time on 256 cores.

More Details

TYPE Conference YEAR 2015

Scopus OSTI DOI