Publications Search

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factorization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG. Major challenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Parallel Graph Coloring for Manycore Architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Deveci, Mehmet; Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran

Graph algorithms are challenging to parallelize on manycore architectures due to complex data dependencies and irregular memory access. We consider the well studied problem of coloring the vertices of a graph. In many applications it is important to compute a coloring with few colors in near-lineartime. In parallel, the optimistic (speculative) coloring method by Gebremedhin and Manne is the preferred approach but it needs to be modified for manycore architectures. We discuss a range of implementation issues for this vertex-based optimistic approach. We also propose a novel edge-based optimistic approach that has more parallelism and is better suited to GPUs. We study the performance empirically on two architectures(Xeon Phi and GPU) and across many data sets (from finite element problems to social networks). Our implementation uses the Kokkos library, so it is portable across platforms. We show that on GPUs, we significantly reduce the number of colors (geometric mean 4X, but up to 48X) as compared to the widely used cuSPARSE library. In addition, our edge-based algorithm is 1.5 times faster on average than cuSPARSE, where it hasspeedups up to 139X on a circuit problem. We also show the effect of the coloring on a conjugate gradient solver using multi-colored Symmetric Gauss-Seidel method as preconditioner, the higher coloring quality found by the proposed methods reduces the overall solve time up to 33% compared to cuSPARSE.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Basker: A threaded sparse LU factorization utilizing hierarchical parallelism and data layouts

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Rajamanickam, Sivasankaran; Thornquist, Heidi K.

Scalable sparse LU factorization is critical for efficient numerical simulation of circuits and electrical power grids. In this work, we present a new scalable sparse direct solver called Basker. Basker introduces a new algorithm to parallelize the Gilbert-Peierls algorithm for sparse LU factorization. As architectures evolve, there exists a need for algorithms that are hierarchical in nature to match the hierarchy in thread teams, individual threads, and vector level parallelism. Basker is designed to map well to this hierarchy in architectures. There is also a need for data layouts to match multiple levels of hierarchy in memory. Basker uses a two-dimensional hierarchical structure of sparse matrices that maps to the hierarchy in the memory architectures and to the hierarchy in parallelism. We present performance evaluations of Basker on the Intel SandyBridge and Xeon Phi platforms using circuit and power grid matrices taken from the University of Florida sparse matrix collection and from Xyce circuit simulations. Basker achieves a geometric mean speedup of 5.91× on CPU (16 cores) and 7.4× on Xeon Phi (32 cores) relative to KLU. Basker outperforms Intel MKL Pardiso (PMKL) by as much as 30× on CPU (16 cores) and 7.5× on Xeon Phi (32 cores) for low fill-in circuit matrices. Furthermore, Basker provides 5.4× speedup on a challenging matrix sequence taken from an actual Xyce simulation.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

A comparison of high-level programming choices for incomplete sparse factorization across different architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Kim, Kyungjoo; Rajamanickam, Sivasankaran

All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel-for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Sparse Matrix-Matrix Multiplication for Modern Architectures

Deveci, Mehmet; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Robust Solvers for Circuit Simulation on Modern Architectures

Rajamanickam, Sivasankaran; Booth, Joshua; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Partitioning and Task Placement with Zoltan2

Deveci, Mehmet; Devine, Karen; Boman, Erik G.; Leung, Vitus J.; Rajamanickam, Sivasankaran; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Parallel Graph Coloring for Many Core Architectures

Deveci, Mehmet; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Task and Data Parallelism Based Direct Solvers and Preconditioners in Manycore Architecture: Efforts in Trilinos/ShyLU

Booth, Joshua D.; Rajamanickam, Sivasankaran; Bradley, Andrew M.; Boman, Erik G.; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Ensemble Grouping strategies for embedded Stochastic Collocation methods applied to anisotropic diffusion problems

Edwards, Harold C.; Hu, Jonathan J.; Phipps, Eric T.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Multi-Jagged: A Scalable Parallel Spatial Partitioning Algorithm

IEEE Transactions on Parallel and Distributed Systems

Deveci, Mehmet; Rajamanickam, Sivasankaran; Devine, Karen; Catalyurek, Umit V.

Geometric partitioning is fast and effective for load-balancing dynamic applications, particularly those requiring geometric locality of data (particle methods, crash simulations). We present, to our knowledge, the first parallel implementation of a multidimensional-jagged geometric partitioner. In contrast to the traditional recursive coordinate bisection algorithm (RCB), which recursively bisects subdomains perpendicular to their longest dimension until the desired number of parts is obtained, our algorithm does recursive multi-section with a given number of parts in each dimension. By computing multiple cut lines concurrently and intelligently deciding when to migrate data while computing the partition, we minimize data movement compared to efficient implementations of recursive bisection. We demonstrate the algorithm's scalability and quality relative to the RCB implementation in Zoltan on both real and synthetic datasets. Our experiments show that the proposed algorithm performs and scales better than RCB in terms of run-time without degrading the load balance. Our implementation partitions 24 billion points into 65,536 parts within a few seconds and exhibits near perfect weak scaling up to 6K cores.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

Parallel Preconditioners and Solvers for Modern Architectures

Boman, Erik G.; Deveci, Mehmet; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Embedded Ensemble Propagation for Improving Performance Portability and Scalability of Uncertainty Quantification on Emerging Computational Architectures

Phipps, Eric T.; Edwards, Harold C.; Hoemmen, Mark F.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance Portability for Linear Algebra with Kokkos

Trott, Christian R.; Edwards, Harold C.; Ellingwood, Nathan D.; Hammond, Simon; Deveci, Mehmet; Boman, Erik G.; Bradley, Andrew M.; Hoemmen, Mark F.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Dungeon Session Application: BDDC Solver Library

Hammond, Simon; Dohrmann, Clark R.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Stelle, George W.; Edwards, Harold C.; Olivier, Stephen L.

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, I.; Boman, Erik G.; Prokopenko, Andrey V.; Heroux, Michael A.; Dongarra, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Basker: A Threaded Sparse LU Factorization Utilizing Hierarchical Parallelism and Data Layouts

Booth, Joshua D.; Rajamanickam, Sivasankaran; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Task-parallel Sparse Incomplete Cholesky Factorization using Kokkos Portable APIs

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

ShyLU and Thread Scalable Subdomain Solvers

Rajamanickam, Sivasankaran; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Deveci, Mehmet; Kim, Kyungjoo; Dohrmann, Clark R.; Thornquist, Heidi K.; Chow, Edmond; Patel, Aftab

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ShyLU: On node Solvers and Kokkos-Kernels

Rajamanickam, Sivasankaran; Boman, Erik G.; Bradley, Andrew M.; Booth, Joshua D.; Kim, Kyungjoo; Deveci, Mehmet

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, Ichitaro; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stan; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Communication-Avoiding Preconditioners for s-step Krylov Methods

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Basker: A Scalable Sparse Direct Linear Solver for Many-Core Architectures

Booth, Joshua D.; Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Architecture-aware Task Placement

Deveci, Mehmet; Devine, Karen; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

WebGraphAnalysisontheBlueWaters Supercomputer

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Irregular Graph Algorithms on Parallel Processing Systems

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

High-Performance Graph Analytics on Manycore Processors

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

The divergence in the computer architecture landscape has resulted in different architectures being considered mainstream at the same time. For application and algorithm developers, a dilemma arises when one must focus on using underlying architectural features to extract the best performance on each of these architectures, while writing portable code at the same time. We focus on this problem with graph analytics as our target application domain. In this paper, we present an abstraction-based methodology for performance-portable graph algorithm design on manicure architectures. We demonstrate our approach by systematically optimizing algorithms for the problems of breadth-first search, color propagation, and strongly connected components. We use Kokkos, a manicure library and programming model, for prototyping our algorithms. Our portable implementation of the strongly connected components algorithm on the NVIDIA Tesla K40M is up to 3.25× faster than a state-of-the-art parallel CPU implementation on a dual-socket Sandy Bridge compute node.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI Scopus

High-Performance Computing for Extreme-Scale Data Analytics

Boman, Erik G.; Madduri, Kamesh; Rajamanickam, Sivasankaran; Wolf, Michael

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Distributing Linear Systems for Parallel Computation

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Supercomputing for Web Graph Analytics

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Embedded Sampling?Based Uncertainty Quantification Approaches for Emerging Computer Architectures

D'Elia, Marta; Phipps, Eric T.; Edwards, Harold C.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Parallel Graph Coloring

Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Preconditioning Communication-Avoiding Krylov Methods

Rajamanickam, Sivasankaran; Yamazaki, Ichitaro; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire; Dongarra, Jack

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Exploring Embedded Uncertainty Quantification Methods on Next-Generation Computer Architectures

Phipps, Eric T.; D'Elia, Marta; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

The Zoltan2 Toolkit: Partitioning Task Placement Coloring and Ordering

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran; Leung, Vitus J.; Riesen, Lee A.; Deveci, Mehmet; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

A hybrid approach for parallel transistor-level full-chip circuit simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Thornquist, Heidi K.; Rajamanickam, Sivasankaran

The computer-aided design (CAD) applications that are fundamental to the electronic design automation industry need to harness the available hardware resources to be able to perform full-chip simulation for modern technology nodes (45nm and below). We will present a hybrid (MPI+threads) approach for parallel transistor-level transient circuit simulation that achieves scalable performance for some challenging large-scale integrated circuits. This approach focuses on the computationally expensive part of the simulator: the linear system solve. Hybrid versions of two iterative linear solver strategies are presented, one takes advantage of block triangular form structure while the other uses a Schur complement technique. Results indicate up to a 27x improvement in total simulation time on 256 cores.

More Details

TYPE Conference YEAR 2015

Scopus OSTI DOI