Publications Search

Scalable triangle counting on distributed-memory systems

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit V.

Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the 'Static Graph Challenge'. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Scalable inference for sparse deep neural networks using kokkos kernels

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Ellis, John A.; Rajamanickam, Sivasankaran

Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

A parallel graph algorithm for detecting mesh singularities in distributed memory ice sheet simulations

ACM International Conference Proceeding Series

Bogle, Ian; Devine, Karen; Perego, Mauro; Rajamanickam, Sivasankaran; Slota, George M.

We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation - - running in parallel and taking a negligible amount of computation time - - so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Math Libraries Workstream - Overview

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Scalable inference for sparse neural networks using Kokkos Kernels

Ellis, John A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

FASTMath: Kokkos Kernels and Linear Solvers

Rajamanickam, Sivasankaran; Bogle, Ian; Hu, Jonathan J.; Devine, Karen; Slota, George M.; Perego, Mauro; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Batched Linear Algebra in Kokkos Kernels

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kim, Kyungjoo; Mclendon, William; Trott, Christian R.; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

SNL+DOE need/ask for NVIDIA Math Libraries

Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Kokkos Kernels and Trilinos Solvers in FASTMath

Rajamanickam, Sivasankaran; Hu, Jonathan J.; Yang, Ulrike

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SpaND: An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

Cambier, Leopold; Chen, Chao; Boman, Erik G.; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.; Darve, Eric

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations

Bogle, Ian; Devine, Karen; Perego, Mauro; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

SpaND: An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

Boman, Erik G.; Cambier, Leopold; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ExaWind

Berger-Vergiat, Luc; Rajamanickam, Sivasankaran; Hu, Jonathan J.; Luchini, Christopher B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Kokkos Kernels

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kim, Kyungjoo; Trott, Christian R.; Wilke, Jason; Mclendon, William

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Large Scale Parallel Solution Methods for Electromagnetic Simulations

Hu, Jonathan J.; Glusa, Christian; Lin, Paul T.; Phillips, Edward; Bays, Nathan R.; Siefert, Christopher; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Understanding the Machine Learning Needs of ECP Applications

Ellis, John A.; Rajamanickam, Sivasankaran

In order to support the codesign needs of ECP applications in current and future hardware in the area of machine learning, the ExaLearn team at Sandia studied the different machine learning use cases in three different ECP applications. This report is a summary of the needs of the three applications. The Sandia ExaLearn team will develop a proxy application representative of ECP application needs, specifically the ExaSky and EXAALT ECP projects. The proxy application will allow us to demonstrate performance portable kernels within machine learning codes. Furthermore, current training scalability of machine learning networks in these applications is negatively affected by large batch sizes. Training throughput of the network will increase as batch size increases, but network accuracy and generalization worsens. The proxy application will contain hybrid model- and data-parallelism to improve training efficiency while maintaining network accuracy. The proxy application will also target optimizing 3D convolutional layers, specific to scientific machine learning, which have not been as thoroughly explored by industry.

More Details

TYPE Other Report YEAR 2019

DOI OSTI