Publications

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Panel:. How to broaden compiler/architecture research participants by utilizing modern infrastructures

Dang, Vinh Q.; Kotulski, Joseph D.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI

A Block-Based Triangle Counting Algorithm on Heterogeneous Environments

Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Berry, Jonathan; Catalyurek, Umit V.

Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Distributed Memory Graph Coloring Algorithms for Multiple GPUs

Bogle, Ian; Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures

ACM International Conference Proceeding Series

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Sparse triangular solver is an important kernel in many computational applications. However, a fast, parallel, sparse triangular solver on a manycore architecture such as GPU has been an open issue in the field for several years. In this paper, we develop a sparse triangular solver that takes advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We implemented our solver using Kokkos and Kokkos Kernels such that our solver is portable to different manycore architectures. This has the additional benefit of allowing our triangular solver to use the team-level kernels and take advantage of the hierarchical parallelism available on the GPU. We compare the effects of different scheduling schemes on the performance and also investigate an algorithmic variant called the partitioned inverse. Our performance results on an NVIDIA V100 or P100 GPU demonstrate that our implementation can be 12.4 × or 19.5 × faster than the vendor optimized implementation in NVIDIA's CuSPARSE library.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architecture

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Recent experiences withMachine Learning Perspectives fromAlgorithms Architectures and Applications

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Accelerating Multiscale Materials Modeling with Machine Learning

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Scalable Inference for Sparse Deep Neural Networks using Kokkos Kernels

Yamazaki, Ichitaro; Rajamanickam, Sivasankaran; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

DOI OSTI

Supernode-based Sparse Triangular Solver using Kokkos

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Practices and Challenges of Software Development for a Performance Portable Ecosystem

Ellingwood, Nathan D.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Compare linear-system solver and preconditioner stacks with emphasis on GPU performance and propose phase-2 NGP solver development pathway

Hu, Jonathan J.; Berger-Vergiat, Luc; Thomas, Stephen; Swirydowicz, Kasia; Yamazaki, Ichitaro; Mullowney, Paul; Rajamanickam, Sivasankaran; Sitaraman, Jay; Sprague, Michael

The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary physics codes in the ExaWind project are Nalu-Wind, which is an unstructured-grid solver for the acoustically incompressible Navier-Stokes equations, and OpenFAST, which is a whole-turbine simulation code. The Nalu-Wind model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step. In this report we evaluated GPU-performance baselines for the linear solvers in the Trilinos and hypre solver stacks using two representative Nalu-Wind simulations: an atmospheric boundary layer precursor simulation on a structured mesh, and a fixed-wing simulation using unstructured overset meshes. Both strong-scaling and weak-scaling experiments were conducted on the OLCF supercomputer Summit and similar proxy clusters. We focused on the performance of multi-threaded Gauss-Seidel and two-stage Gauss-Seidel that are extensions of classical Gauss-Seidel; of one-reduce GMRES, a communication-reducing variant of the Krylov GMRES; and algebraic multigrid methods that incorporate the afore-mentioned methods. The team has established that AMG methods are capable of solving linear systems arising from the fixed-wing overset meshes on CPU, a critical intermediate result for ExaWind FY20 Q3 and Q4 milestones. For the fixed-wing strong-scaling study (model with 3M grid-points), the team identified that Nalu-Wind simulations with the new Trilinos and hypre solvers scale to modest GPU counts, maintaining above 70% efficiency up to 6 GPUs. However, there still remain significant bottlenecks to performance: matrix assembly (hypre), AMG setup (hypre and Trilinos) In the weak-scaling experiments (going from 0.4M to 211M gridpoints), it's shown that the solver apply phases are faster on GPUs, but that Nalu-Wind simulation times grow, primarily due to the multigrid-setup process. Finally, based on the report outcomes, we propose a linear solver path-forward for the remainder of the ExaWind project. Near term, the NREL team will continue their work on GPU-based linear-system assembly. They will also investigate how the use of alternatives to the NVIDIA UVM (unified virtual memory) paradigm affects performance. Longer term, the NREL team will evaluate algorithmic performance on other types of accelerators and merge their improvements back to the main hypre repository branch. Near term, the Trilinos team will address performance bottlenecks identified in this milestone, such as implementing a GPU-based segregated momentum solve and reusing matrix graphs across linear-system assembly phases. Longer term, the Trilinos team will do detailed analysis and optimization of multigrid setup.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

SPHYNX: Spectral partitioning for HYbrid and aXelerator-enabled systems

Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020

Acer, Seher; Boman, Erik G.; Rajamanickam, Sivasankaran

Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no scalable, distributed-memory, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. We use Sphnyx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on GPU performance. We perform those evaluations on irregular graphs, because state-of-the-art partitioners have the most difficulty on them. We demonstrate that Sphynx is up to 17x faster on GPUs compared to the case on CPUs, and up to 580x faster compared to a state-of-the-art multilevel partitioner. Sphynx provides a robust alternative for applications looking for a GPU-based partitioner.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Utilizing Spatial Accelerators for Machine Learning and Linear Algebra Kernels

Moon, Gordon E.; Rajamanickam, Sivasankaran; Krishna, Tushar; Kwon, Hyoukjun; Chatarasi, Prasanth; Qin, Eric

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

MINT: Microarchitecture for Efficient and Interchangeable CompressioN Formats on Tensor Algebra

Qin, Eric; Jeong, Geonhwa; Won, William; Kao, Sheng-Chun; Kwon, Hyoukjun; Srinivasan, Sudarshan; Das, Dipankar; Moon, Gordon E.; Rajamanickam, Sivasankaran; Krishna, Tushar

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

SPHYNX: Spectral Partitioning for HYbrid aNd aXelerator-based systems

Acer, Seher; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Kotulski, Joseph D.; Dang, Vinh Q.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ECP Report: Update on Proxy Applications and Vendor Interactions

Ang, Jim; Sweeney, Christine; Wolf, Michael; Ellis, John A.; Ghosh, Sayan; Kagawa, Ai; Huang, Yunzhi; Rajamanickam, Sivasankaran; Ramakrishnaiah, Vinay; Schram, Malachi; Yoo, Shinjae

The ExaLearn miniGAN team (Ellis and Rajamanickam) have released miniGAN, a generative adversarial network(GAN) proxy application, through the ECP proxy application suite. miniGAN is the first machine learning proxy application in the suite (note: the ECP CANDLE project did previously release some benchmarks) and models the performance for training generator and discriminator networks. The GAN's generator and discriminator generate plausible 2D/3D maps and identify fake maps, respectively. miniGAN aims to be a proxy application for related applications in cosmology (CosmoFlow, ExaGAN) and wind energy (ExaWind). miniGAN has been developed so that optimized mathematical kernels (e.g., kernels provided by Kokkos Kernels) can be plugged into to the proxy application to explore potential performance improvements. miniGAN has been released as open source software and is available through the ECP proxy application website (https://proxyapps.exascaleproject.ordecp-proxy-appssuite/) and on GitHub (https://github.com/SandiaMLMiniApps/miniGAN). As part of this release, a generator is provided to generate a data set (series of images) that are inputs to the proxy application.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Preparing sparse solvers for exascale computing

Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences

Heroux, Michael A.; Anzt, Hartwig; Boman, Erik G.; Falgout, Rob; Ghysels, Pieter; Li, Xiaoye; Mcinnes, Lois C.; Mills, Richard T.; Rajamanickam, Sivasankaran; Rupp, Karl; Smith, Bryce; Yamazaki, Ichitaro; Yang, Ulrike M.

Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

How Robust are Graph Neural Networks to Structural Noise?

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Kokkos Kernels: Performance Portable Kernels for Sparse/Dense Linear Algebra Graph and Machine Learning Kernels

Zinser, Brian F.; Blake, Samuel A.; Pfeiffer, Robert A.; Huang, Andy; Himbele, John J.; Freno, Brian A.; Dang, Vinh Q.; Kotulski, Joseph D.; Rajamanickam, Sivasankaran; Johnson, William A.; Campione, Salvatore; Langston, William L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Gemma: A Sandia National Laboratories Electromagnetic Code for Heterogeneous Computer Architectures

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Kokkos Kernels

Rajamanickam, Sivasankaran; Acer, Seher; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kelley, Brian M.; Kim, Kyungjoo; Trott, Christian R.; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Machine Learning in the Presence of Noise : Early Experiments

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

How Robust Are Graph Neural Networks to Structural Noise?

Acer, Seher; Boman, Erik G.; Rajamanickam, Sivasankaran; Bogle, Ian; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

CLOVER

Boman, Erik G.; Anzt, Hartwig; Dongarra, Jack; Gates, Mark; Rajamanickam, Sivasankaran; Tomov, Stan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ExaGraph: Parallel Partitioning and Coloring for Exascale Applications

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Berry, Jonathan; Wolf, Michael; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

DOI OSTI

Mixed-precision preconditioned Krylov solvers with Trilinos

Yamazaki, Ichitaro; Boman, Erik G.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Distributed Biconnectivity

Bogle, Ian; Slota, George M.; Rajamanickam, Sivasankaran; Devine, Karen

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Linear Solvers for Plasma Simulations on Advanced Architectures

Hu, Jonathan J.; Glusa, Christian; Moore, Stan G.; Lin, Paul; Phillips, Edward; Bettencourt, Matthew T.; Foulk, James W.; Siefert, Christopher; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Evolution of the Kokkos Ecosystem

Miles, Jeffery S.; Trott, Christian R.; Brunini, Victor; Bettencourt, Matthew T.; Poliakoff, David; Rajamanickam, Sivasankaran; Hollman, David S.; Wolf, Michael; Glass, Micheal W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Exascale Node-Level Parallel Programming Environments: Overview and Deciding What's Right for You

Rajamanickam, Sivasankaran; Heroux, Michael A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

How Robust Are Graph Neural Networks to Structural Noise?

Graph neural networks (GNNs) are an emerging model for learning graph embeddings and making predictions on graph structured data. However, robustness of graph neural networks is not yet well-understood. In this work, we focus on node structural identity predictions, where a representative GNN model is able to achieve near-perfect accuracy. We also show that the same GNN model is not robust to addition of structural noise, through a controlled dataset and set of experiments. Finally, we show that under the right conditions, graph-augmented training is capable of significantly improving robustness to structural noise.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

FROSch: A Fast And Robust Overlapping Schwarz Domain Decomposition Preconditioner Based on Xpetra in Trilinos

Lecture Notes in Computational Science and Engineering

Heinlein, Alexander; Klawonn, Axel; Rajamanickam, Sivasankaran; Rheinbach, Oliver

This article describes a parallel implementation of a two-level overlapping Schwarz preconditioner with the GDSW (Generalized Dryja–Smith–Widlund) coarse space described in previous work [12, 10, 15] into the Trilinos framework; cf. [16]. The software is a significant improvement of a previous implementation [12]; see Sec. 4 for results on the improved performance.

More Details

TYPE Other Report YEAR 2020

DOI OSTI Scopus

An algebraic sparsified nested dissection algorithm using low-rank approximations

SIAM Journal on Matrix Analysis and Applications

Cambier, Leopold; Boman, Erik G.; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.; Darve, Eric

We propose a new algorithm for the fast solution of large, sparse, symmetric positive-definite linear systems, spaND (sparsified Nested Dissection). It is based on nested dissection, sparsification, and low-rank compression. After eliminating all interiors at a given level of the elimination tree, the algorithm sparsifies all separators corresponding to the interiors. This operation reduces the size of the separators by eliminating some degrees of freedom but without introducing any fill-in. This is done at the expense of a small and controllable approximation error. The result is an approximate factorization that can be used as an efficient preconditioner. We then perform several numerical experiments to evaluate this algorithm. We demonstrate that a version using orthogonal factorization and block-diagonal scaling takes fewer CG iterations to converge than previous similar algorithms on various kinds of problems. Furthermore, this algorithm is provably guaranteed to never break down and the matrix stays symmetric positive-definite throughout the process. We evaluate the algorithm on some large problems show it exhibits near-linear scaling. The factorization time is roughly \scrO (N), and the number of iterations grows slowly with N.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

FROSch: A Fast And Robust Overlapping Schwarz Domain Decomposition Preconditioner Based on Xpetra in Trilinos

Lecture Notes in Computational Science and Engineering

Heinlein, Alexander; Klawonn, Axel; Rajamanickam, Sivasankaran; Rheinbach, Oliver

This article describes a parallel implementation of a two-level overlapping Schwarz preconditioner with the GDSW (Generalized Dryja–Smith–Widlund) coarse space described in previous work [12, 10, 15] into the Trilinos framework; cf. [16]. The software is a significant improvement of a previous implementation [12]; see Sec. 4 for results on the improved performance.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

P1673R2: A free function linear algebra interface based on the BLAS

Hoemmen, Mark F.; Hollman, David S.; Trott, Christian R.; Liber, Nevin; Rajamanickam, Sivasankaran; Lo, Li-Ta; Lopez, Graham; Caday, Peter; Knepper, Sarah; Costa, Timothy

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ExaGraph: Combinatorial Methods for Enabling Exascale Applications

Halappanavar, Mahantesh; Buluc, Aydin; Boman, Erik G.; Pothen, Alex; Tumeo, Antonino; Khan, Arif; Minutoli, Marco; Tallent, Nathan; Gawande, Nitin; Ekanayake, Saliya; Ghosh, Sayan; Acer, Seher; Rajamanickam, Sivasankaran; Ferdous, S.M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Machine Learning in the Presence of Noise : Early Experiments

Berger-Vergiat, Luc; Hu, Jonathan J.; Rajamanickam, Sivasankaran; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ExaWind: Exascale Predictive Wind Plant Flow Physics Modeling

Sprague, M.; Ananthan, S.; Brazell, M.; Glaws, A.; De Frahan, M.; King, R.; Natarajan, M.; Rood, J.; Sharma, A.; Sirydowicz, K.; Thomas, S.; Vijaykumar, G.; Yellapantula, S.; Crozier, Paul; Berger-Vergiat, Luc; Cheung, Lawrence; Glaze, David J.; Hu, Jonathan J.; Knaus, Robert C.; Lee, Dong H.; Okusanya, Tolulope O.; Overfelt, James R.; Rajamanickam, Sivasankaran; Sakievich, Philip; Smith, Timothy A.; Vo, Johnathan; Williams, Alan B.; Yamazaki, Ichitaro; Turner, J.; Prokopenko, A.; Wilson, R.; Moser, R.; Melvin, J.; Sitaraman, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Trilinos linear solvers group update

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

How Robust Are Graph Neural Networks to Structural Noise?

Slota, George M.; Berry, Jonathan; Hammond, Simon; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Scalable generation of graphs for benchmarking HPC community-detection algorithms

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Scalable Generation of Graphs for Benchmarking HPC Community-Detection Algorithms

Slota, George M.; Berry, Jonathan; Hammond, Simon; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Multi-Level Memory Algorithmics for Large, Sparse Problems

Berry, Jonathan; Butcher, Neil; Catalyurek, Umit; Kogge, Peter; Lin, Paul; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran; Slota, George M.; Voskuilen, Gwendolyn R.; Yasar, Abdurrahman; Young, Jeffrey S.

In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to effectively use multiple layers of main memory. Modern compute nodes all have this feature in some form.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

Fast and Robust Linear Solvers based on Hierarchical Matrices (LDRD Final Report)

Boman, Erik G.; Darve, Eric; Lehoucq, Rich; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.; Yamazaki, Ichitaro

This report is the final report for the LDRD project "Fast and Robust Linear Solvers using Hierarchical Matrices". The project was a success. We developed two novel algorithms for solving sparse linear systems. We demonstrated their effectiveness on ill-conditioned linear systems from ice sheet simulations. We showed that in many cases, we can obtain near-linear scaling. We believe this approach has strong potential for difficult linear systems and should be considered for other Sandia and DOE applications. We also report on some related research activities in dense solvers and randomized linear algebra.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

How Robust Are Graph Neural Networks to Structural Noise?

Wright, Catherine; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Predicting Molecular Toxicity from Structural Identity

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Kokkos Kernels: Library Based Approach for Performance Portable Sparse/Dense linear algebra and Graph Kernels

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A Portable SIMD Primitive in Kokkos for Heterogeneous Architectures

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran; Berzins, Martin

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A robust hierarchical solver for ill-conditioned systems with applications to ice sheet modeling

Journal of Computational Physics

Chen, Chao; Cambier, Leopold; Boman, Erik G.; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.; Darve, Eric

A hierarchical solver is proposed for solving sparse ill-conditioned linear systems in parallel. The solver is based on a modification of the LoRaSp method, but employs a deferred-compression technique, which provably reduces the approximation error and significantly improves efficiency. Moreover, the deferred-compression technique introduces minimal overhead and does not affect parallelism. As a result, the new solver achieves linear computational complexity under mild assumptions and excellent parallel scalability. To demonstrate the performance of the new solver, we focus on applying it to solve sparse linear systems arising from ice sheet modeling. The strong anisotropic phenomena associated with the thin structure of ice sheets creates serious challenges for existing solvers. To address the anisotropy, we additionally developed a customized partitioning scheme for the solver, which captures the strong-coupling direction accurately. In general, the partitioning can be computed algebraically with existing software packages, and thus the new solver is generalizable for solving other sparse linear systems. Our results show that ice sheet problems of about 300 million degrees of freedom have been solved in just a few minutes using 1024 processors.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Scalable inference for sparse deep neural networks using kokkos kernels

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Bogle, Ian; Devine, Karen; Perego, Mauro; Rajamanickam, Sivasankaran; Slota, George M.

Over the last decade, hardware advances have led to the feasibility of training and inference for very large deep neural networks. Sparsified deep neural networks (DNNs) can greatly reduce memory costs and increase throughput of standard DNNs, if loss of accuracy can be controlled. The IEEE HPEC Sparse Deep Neural Network Graph Challenge serves as a testbed for algorithmic and implementation advances to maximize computational performance of sparse deep neural networks. We base our sparse network for DNNs, KK-SpDNN, on the sparse linear algebra kernels within the Kokkos Kernels library. Using the sparse matrix-matrix multiplication in Kokkos Kernels allows us to reuse a highly optimized kernel. We focus on reducing the single node and multi-node runtimes for 12 sparse networks. We test KK-SpDNN on Intel Skylake and Knights Landing architectures and see 120-500x improvement on single node performance over the serial reference implementation. We run in data-parallel mode with MPI to further speed up network inference, ultimately obtaining an edge processing rate of 1.16e+12 on 20 Skylake nodes. This translates to a 13x speed up on 20 nodes compared to our highly optimized multithreaded implementation on a single Skylake node.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Scalable triangle counting on distributed-memory systems

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit V.

Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the 'Static Graph Challenge'. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Geometric mapping of tasks to processors on parallel computers with mesh or torus networks

IEEE Transactions on Parallel and Distributed Systems

Deveci, Mehmet; Devine, Karen; Foulk, James W.; Taylor, Mark A.; Rajamanickam, Sivasankaran; Catalyurek, Umit V.

We present a new method for reducing parallel applications’ communication time by mapping their MPI tasks to processors in a way that lowers the distance messages travel and the amount of congestion in the network. Assuming geometric proximity among the tasks is a good approximation of their communication interdependence, we use a geometric partitioning algorithm to order both the tasks and the processors, assigning task parts to the corresponding processor parts. In this way, interdependent tasks are assigned to “nearby” cores in the network. We also present a number of algorithmic optimizations that exploit specific features of the network or application to further improve the quality of the mapping. We specifically address the case of sparse node allocation, where the nodes assigned to a job are not necessarily located in a contiguous block nor within close proximity to each other in the network. However, our methods generalize to contiguous allocations as well, and results are shown for both contiguous and non-contiguous allocations. We show that, for the structured finite difference mini-application MiniGhost, our mapping methods reduced communication time up to 75 percent relative to MiniGhost’s default mapping on 128K cores of a Cray XK7 with sparse allocation. For the atmospheric modeling code E3SM/HOMME, our methods reduced communication time up to 31% on 16K cores of an IBM BlueGene/Q with contiguous allocation.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : ((Update on Static Graph Challenge)

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Berry, Jonathan; Acer, Seher; Wolf, Michael; Young, Jeffrey S.; Catalyurek, Umit V.

Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year's graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.

More Details

TYPE Conference Poster YEAR 2019

OSTI Scopus

A parallel graph algorithm for detecting mesh singularities in distributed memory ice sheet simulations

ACM International Conference Proceeding Series

We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation - - running in parallel and taking a negligible amount of computation time - - so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Math Libraries Workstream - Overview

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Kokkos Kernels and Trilinos Solvers in FASTMath

Rajamanickam, Sivasankaran; Hu, Jonathan J.; Yang, Ulrike

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SNL+DOE need/ask for NVIDIA Math Libraries

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kim, Kyungjoo; Mclendon, William; Trott, Christian R.; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Batched Linear Algebra in Kokkos Kernels

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

FASTMath: Kokkos Kernels and Linear Solvers

Rajamanickam, Sivasankaran; Bogle, Ian; Hu, Jonathan J.; Devine, Karen; Slota, George M.; Perego, Mauro; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Scalable inference for sparse neural networks using Kokkos Kernels

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kim, Kyungjoo; Trott, Christian R.; Wilke, Jason; Mclendon, William

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher; Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations

Bogle, Ian; Devine, Karen; Perego, Mauro; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

SpaND: An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

Cambier, Leopold; Chen, Chao; Boman, Erik G.; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.; Darve, Eric

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SpaND: An Algebraic Sparsified Nested Dissection Algorithm Using Low-Rank Approximations

Boman, Erik G.; Cambier, Leopold; Chen, Chao; Darve, Eric; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Kokkos Kernels

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ExaWind

Berger-Vergiat, Luc; Rajamanickam, Sivasankaran; Hu, Jonathan J.; Luchini, Christopher B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Large Scale Parallel Solution Methods for Electromagnetic Simulations

Hu, Jonathan J.; Glusa, Christian; Lin, Paul T.; Phillips, Edward; Foulk, James W.; Siefert, Christopher; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Understanding the Machine Learning Needs of ECP Applications

Hu, Jonathan J.; Rajamanickam, Sivasankaran

In order to support the codesign needs of ECP applications in current and future hardware in the area of machine learning, the ExaLearn team at Sandia studied the different machine learning use cases in three different ECP applications. This report is a summary of the needs of the three applications. The Sandia ExaLearn team will develop a proxy application representative of ECP application needs, specifically the ExaSky and EXAALT ECP projects. The proxy application will allow us to demonstrate performance portable kernels within machine learning codes. Furthermore, current training scalability of machine learning networks in these applications is negatively affected by large batch sizes. Training throughput of the network will increase as batch size increases, but network accuracy and generalization worsens. The proxy application will contain hybrid model- and data-parallelism to improve training efficiency while maintaining network accuracy. The proxy application will also target optimizing 3D convolutional layers, specific to scientific machine learning, which have not been as thoroughly explored by industry.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

Trilinos Scalable Solvers and Kokkos Kernels

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

ExaGraph: Parallel Coloring and Partitioning for Exascale Applications

Boman, Erik G.; Rajamanickam, Sivasankaran; Acer, Seher

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Kokkos Kernels

Rajamanickam, Sivasankaran; Berger-Vergiat, Luc; Dang, Vinh Q.; Ellingwood, Nathan D.; Kim, Kyungjoo; Mclendon, William; Trott, Christian R.; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

ECP Panel on applications

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Asynchronous Iterative Solvers for Extreme Scale

Boman, Erik G.; Chow, Edmond; Dongarra, Jack; Glusa, Christian; Rajamanickam, Sivasankaran; Szyld, Daniel; Yamazaki, Ichitaro

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

An Algebraic Sparsified Nested Dissection Algorithm using Low-Rank Approximations

Cambier, Leopold; Chen, Chao; Darve, Eric; Boman, Erik G.; Rajamanickam, Sivasankaran; Tuminaro, Raymond S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Using Graph Toolkits - Do interfaces matter