Publications Search

Using architecture information and real-time resource state to reduce power consumption and communication costs in parallel applications

Brandt, James M.; Devine, Karen; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Foulk, James W.; Rajamanickam, Sivasankaran; Bunde, David P.; Deveci, Mehmet; Catalyurek, Umit V.

As computer systems grow in both size and complexity, the need for applications and run-time systems to adjust to their dynamic environment also grows. The goal of the RAAMP LDRD was to combine static architecture information and real-time system state with algorithms to conserve power, reduce communication costs, and avoid network contention. We devel- oped new data collection and aggregation tools to extract static hardware information (e.g., node/core hierarchy, network routing) as well as real-time performance data (e.g., CPU uti- lization, power consumption, memory bandwidth saturation, percentage of used bandwidth, number of network stalls). We created application interfaces that allowed this data to be used easily by algorithms. Finally, we demonstrated the benefit of integrating system and application information for two use cases. The first used real-time power consumption and memory bandwidth saturation data to throttle concurrency to save power without increasing application execution time. The second used static or real-time network traffic information to reduce or avoid network congestion by remapping MPI tasks to allocated processors. Results from our work are summarized in this report; more details are available in our publications [2, 6, 14, 16, 22, 29, 38, 44, 51, 54].

More Details

TYPE SAND Report YEAR 2014

DOI OSTI

Kokkos a Manycore DevicePerformance Portability Libraryfor C++ HPC Applications

Rajamanickam, Sivasankaran; Edwards, Harold C.; Trott, Christian R.; Sunderland, Daniel

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Computer Science Research Institute (CSRI) Summer Proceedings 2013

Rajamanickam, Sivasankaran; Parks, Michael L.; Collis, Samuel S.

The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia National Laboratories for focused collaborative research on computer science, computational science, and mathematics problems that are critical to the mission of the laboratories, the Department of Energy, and the United States. The CSRI provides a mechanism by which university researchers learn about and impact national— and global—scale problems while simultaneously bringing new ideas from the academic research community to bear on these important problems. A key component of CSRI programs over the last decade has been an active and productive summer program where students from around the country conduct internships at CSRI. Each student is paired with a Sandia staff member who serves as technical advisor and mentor. The goals of the summer program are to expose the students to research in mathematical and computer sciences at Sandia and to conduct a meaningful and impactful summer research project with their Sandia mentor. Every effort is made to align summer projects with the student's research objectives and all work is coordinated with the ongoing research activities of the Sandia mentor in alignment with Sandia technical thrusts. For the 2013 CSRI Proceedings, research articles have been organized into the following broad technical focus areas — Computational Mathematics and Algorithms, Combinatorial Algorithms and Visualization, Advanced Architectures and Systems Software, Computational Applications — which are well aligned with Sandia's strategic thrusts in computer and information sciences.

More Details

TYPE Other Report YEAR 2014

DOI OSTI

PULP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Rajamanickam, Sivasankaran; Slota, George M.; Madduri, Kamesh

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Towards extreme-scale simulations for low-Mach fluids with second-generation Trilinos

Parallel Processing Letters

Lin, Paul T.; Bettencourt, Matthew T.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran; Siefert, Christopher; Kennon; Kennon, Stephen R.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2014

DOI OSTI

FASTMath Partitioning and Task Placement

Devine, Karen; Diamond, Gerrett; Ibanez, Dan; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran; Shephard, Mark; Smith, Cameron

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Zoltan Three-Slide Overview for ATPESC 2014

Devine, Karen; Rajamanickam, Sivasankaran; Prokopenko, Andrey V.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Towards Extreme-scale Simulations with Next-Generation Trilinos: a low Mach application case study

Lin, Paul T.; Siefert, Christopher; Cyr, Eric C.; Bettencourt, Matthew T.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Building Blocks for Graph Based Network Analysis

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Improving Parallel Performance of Coarse Grids in an Algebraic Multigrid

Prokopenko, Andrey V.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on Distributed GPUs

Boman, Erik G.; Heroux, Michael A.; Hoemmen, Mark F.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Zoltan2: Exploiting Geometric Partitioning in Task Mapping for Parallel Computers

Leung, Vitus J.; Rajamanickam, Sivasankaran; Pedretti, Kevin; Olivier, Stephen L.; Devine, Karen

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Parallel Strongly Connected Components in Shared Memory Architectures

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Reducing Coarse Grids Contention in a Parallel Algebraic Multigrid

Prokopenko, Andrey V.; Hu, Jonathan J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Domain Decomposition Preconditioners for Communication-Avoiding Krylov Methods on a Hybrid CPU/GPU Cluster

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Boman, Erik G.; Hoemmen, Mark F.; Heroux, Michael A.; Tomov, Stanimire

Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

BFS and coloring-based parallel algorithms for strongly connected components and related problems

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS

Slota, George M.; Rajamanickam, Sivasankaran; Madduri, Kamesh

Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines. © 2014 IEEE.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

PuLP: Scalable multi-objective multi-constraint partitioning for small-world networks

Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014

Slota, George M.; Madduri, Kamesh; Rajamanickam, Sivasankaran

We present PuLP, a parallel and memory-efficient graph partitioning method specifically designed to partition low-diameter networks with skewed degree distributions. Graph partitioning is an important Big Data problem because it impacts the execution time and energy efficiency of graph analytics on distributed-memory platforms. Partitioning determines the in-memory layout of a graph, which affects locality, intertask load balance, communication time, and overall memory utilization of graph analytics. A novel feature of our method PuLP (Partitioning using Label Propagation) is that it optimizes for multiple objective metrics simultaneously, while satisfying multiple partitioning constraints. Using our method, we are able to partition a web crawl with billions of edges on a single compute server in under a minute. For a collection of test graphs, we show that PuLP uses 8-39× less memory than state-of-the-art partitioners and is up to 14.5× faster, on average, than alternate approaches (with 16-way parallelism). We also achieve better partitioning quality results for the multi-objective scenario.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

Enabling extreme-scale simulations with next-generation Trilinos for Sierra low Mach fluid application code

Lin, Paul T.; Siefert, Christopher; Bettencourt, Matthew T.; Domino, Stefan P.; Fisher, Travis C.; Hoemmen, Mark F.; Hu, Jonathan J.; Phipps, Eric T.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Using 2D Matrix Distributions in Trilinos

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Exploiting Geometric Partitioning in Task Mapping for Parallel Computers

Rajamanickam, Sivasankaran; Leung, Vitus J.; Pedretti, Kevin P.; Olivier, Stephen L.; Devine, Karen

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The Zoltan Toolkits: Parallel Partitioning Load Balancing Coloring and Ordering

Devine, Karen; Boman, Erik G.; Rajamanickam, Sivasankaran; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Computing Strongly Connected Components in Modern Architectures

Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Multi-jagged: A Scalable Multi-section based Spatial Partitioning Algorithm

Rajamanickam, Sivasankaran; Devine, Karen

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning

Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2013

DOI OSTI

Combinatorial Scientific Computing for Exascale Systems and Applications

Devine, Karen; Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Matrix Computations on Large Scale-Free Graphs Using 2D Graph Partitioning

Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2013

DOI OSTI

Neuron Simulation and Analysis with Xyce

Schiek, Richard; Mei, Ting; Rajamanickam, Sivasankaran; Keiter, Eric R.; Warrender, Christina E.; Aimone, James B.; Thornquist, Heidi K.; Russo, Thomas V.; Verley, Jason C.; Crossno, Patricia J.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Multithreaded Sparse Kernels for Solution of Sparse Linear Systems

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Experiences with Xeon Phi

Hammond, Simon; Rajamanickam, Sivasankaran; Ang, James A.; Barrett, Richard F.; Doerfler, Douglas W.; Heroux, Michael A.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable matrix computations on large scale-free graphs using 2D graph partitioning

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran

Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that two dimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores. Copyright 2013 ACM.

More Details

TYPE Conference YEAR 2013

DOI OSTI Scopus

Multi-jagged: A Scalable Multi-section based Spatial Partitioning Algorithm

Rajamanickam, Sivasankaran; Devine, Karen

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

ShyLU: A hybrid-hybrid solver for multicore platforms

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Rajamanickam, Sivasankaran; Boman, Erik G.; Heroux, Michael A.

With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a sub domain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance. © 2012 IEEE.

More Details

TYPE Conference YEAR 2012

OSTI Scopus

Multithreaded algorithms for maxmum matching in bipartite graphs

Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Azad, Ariful; Halappanavar, Mahantesh; Rajamanickam, Sivasankaran; Boman, Erik G.; Khan, Arif; Pothen, Alex

We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm? We answer this question in the affirmative. We report efficient multithreaded implementations of three classes of algorithms based on their manner of searching for augmenting paths: breadth-first-search, depth-first-search, and a combination of both. The Karp-Sipser initialization algorithm is used to make the parallel algorithms practical. We report extensive results and insights using three shared-memory platforms (a 48-core AMD Opteron, a 32-coreIntel Nehalem, and a 128-processor Cray XMT) on a representative set of real-world and synthetic graphs. To the best of our knowledge, this is the first study of augmentation-based parallel algorithms for bipartite cardinality matching that demonstrates good speedups on multithreaded shared memory multiprocessors. © 2012 IEEE.

More Details

TYPE Conference YEAR 2012

Scopus OSTI

ShyLU: A Hybrid-Hybrid Solver

Rajamanickam, Sivasankaran; Boman, Erik G.; Heroux, Michael A.; Thornquist, Heidi K.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Zoltan2: Next-Generation Combinatorial Toolkit

Boman, Erik G.; Devine, Karen; Leung, Vitus J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Data Partitioning for Scientific Applications and Emerging Architectures

Devine, Karen; Leung, Vitus J.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Parallel Partitioning with Zoltan: Is Hypergraph Partitioning Worth It?

Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Multithreaded Maximum Matching in Bipartite Graphs

Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

ShyLU: A Hybrid-hybrid Solver for Multicore Platforms

Rajamanickam, Sivasankaran; Boman, Erik G.; Heroux, Michael A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Multithreaded Algorithms for Maximum Matching in Bipartite Graphs

Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Partitioning for Hybrid Solvers: ShyLU and HIPS

Boman, Erik G.; Rajamanickam, Sivasankaran; Gaidamour, Jeremie

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Towards Efficient Preconditioning in Manycore Architectures

Rajamanickam, Sivasankaran; Heroux, Michael A.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

An Evaluation of the Zoltan Parallel Graph and Hypergraph Partitioners

Rajamanickam, Sivasankaran; Boman, Erik G.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

A hybrid-hybrid solver for manycore platforms

SC'11 - Proceedings of the 2011 High Performance Computing Networking, Storage and Analysis Companion, Co-located with SC'11

Rajamanickam, Sivasankaran; Boman, Erik G.; Heroux, Michael A.

With the increasing levels of parallelism in a compute node, it is important to exploit multiple levels of parallelism even within a single compute node. We present ShyLU (pro- nounced\Shy-loo"for Scalable Hybrid LU), a\hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative method is based on approximate Schur com- plements. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). Our solver is use- ful both in shared-memory environments and on large par- allel computers with distributed memory (as a subdomain solver). We compare the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 192 cores for a given problem size. We compare at MPI performance of ShyLU against a hybrid implementation. We conclude that on present multicore nodes at MPI is better. However, for future manycore machines (48 or more cores) hybrid/ hi- erarchical algorithms and implementations are important for sustained performance. Copyright is held by the author/owner(s).

More Details

TYPE Conference YEAR 2011

OSTI Scopus