Publications Search

Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

KKTri: Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kim, Kyungjoo; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Asynchronous Iterative Solvers: The ACHILES library and domain decomposition methods

Boman, Erik G.; Rajamanickam, Sivasankaran; Glusa, Christian; Chow, Edmond; Ramanan, Paritosh

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

A Parallel Hierarchical Low-Rank Solver for General Sparse Matrices

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

A Hierarchical Low-Rank Solver for Large Sparse Linear Systems

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Portable Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Trott, Christian R.; Hammond, Simon; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Task Placement to Reduce Application Communication Costs

Devine, Karen; Brandt, James M.; Deveci, Mehmet; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Foulk, James W.; Rajamanickam, Sivasankaran; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Performance-portable sparse matrix-matrix multiplication for many-core architectures

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Deveci, Mehmet; Trott, Christian R.; Rajamanickam, Sivasankaran

We consider the problem of writing performance portablesparse matrix-sparse matrix multiplication (SPGEMM) kernelfor many-core architectures. We approach the SPGEMMkernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design ahierarchical, memory-efficient SPGEMM algorithm. We thendesign and implement thread scalable data structures thatenable us to develop a portable SPGEMM implementation. We show that the method achieves performance portabilityon massively threaded architectures, namely Intel's KnightsLanding processors (KNLs) and NVIDIA's Graphic ProcessingUnits (GPUs), by comparing its performance to specializedimplementations. Second, we study an important aspectof SPGEMM's usage in practice by reusing the structure ofinput matrices, and show speedups up to 3× compared to thebest specialized implementation on KNLs. We demonstratethat the portable method outperforms 4 native methods on2 different GPU architectures (up to 17× speedup), and it ishighly thread scalable on KNLs, in which it obtains 101× speedup on 256 threads.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Order or shuffle: Empirically evaluating vertex order impact on parallel graph computations

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2017

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

The in-memory graph layout affects performance of distributed-memory graph computations. Graph layout could refer to partitioning or replication of vertex and edge arrays, selective replication of data structures that hold meta-data, and reordering vertex and edge identifiers. In this work, we consider one-dimensional graph layouts, where disjoint sets of vertices and their adjacencies are partitioned among processors. Using the PuLP graph partitioning method and a breadth-first search (BFS)-based vertex ordering strategy, we empirically evaluate the impact of this graph layout on a collection of five distributed-memory graph computations. Our evaluation considers several objective metrics in addition to execution time, and we observe a considerable performance improvement over randomization.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Embedded ensemble propagation for improving performance, portability, and scalability of uncertainty quantification on emerging computational architectures

SIAM Journal on Scientific Computing

Phipps, Eric T.; Edwards, Harold C.; Hoemmen, Mark F.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

In this study, quantifying simulation uncertainties is a critical component of rigorous predictive simulation. A key component of this is forward propagation of uncertainties in simulation input data to output quantities of interest. Typical approaches involve repeated sampling of the simulation over the uncertain input data, and can require numerous samples when accurately propagating uncertainties from large numbers of sources. Often simulation processes from sample to sample are similar and much of the data generated from each sample evaluation could be reused. We explore a new method for implementing sampling methods that simultaneously propagates groups of samples together in an embedded fashion, which we call embedded ensemble propagation. We show how this approach takes advantage of properties of modern computer architectures to improve performance by enabling reuse between samples, reducing memory bandwidth requirements, improving memory access patterns, improving opportunities for fine-grained parallelization, and reducing communication costs. We describe a software technique for implementing embedded ensemble propagation based on the use of C++ templates and describe its integration with various scientific computing libraries within Trilinos. We demonstrate improved performance, portability and scalability for the approach applied to the simulation of partial differential equations on a variety of CPU, GPU, and accelerator architectures, including up to 131,072 cores on a Cray XK7 (Titan).

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

ShyLU: A Collection of Node-Scalable Sparse Linear Solvers

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Boman, Erik G.; Deveci, Mehmet

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Enabling Low Mach Fluid Simulations Using Trilinos

Hu, Jonathan J.; Devine, Karen; Hoemmen, Mark F.; Lin, Paul T.; Rajamanickam, Sivasankaran; Roberts, Nathan V.; Siefert, Christopher; Trott, Christian R.; Prokopenko, Andrey

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

A Hierarchical Low-rank Solver for Sparse Linear Systems

Boman, Erik G.; Chen, Chao; Darve, Eric; Pouransari, Hadi; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Hierarchical Matrices and Low-Rank Methods for Extreme-Scale Solvers

Boman, Erik G.; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Publications

Search results