Page 2 – Center for Computing Research (CCR)

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Applications of Compact Batched Kernels

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Deveci, Mehmet D.; Kim, Kyungjoo K.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Kokkoskernels

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Deveci, Mehmet D.; Kim, Kyungjoo K.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

KokkosKernels: Performance-Portable Sparse Dense and Graph Kernels

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Deveci, Mehmet D.; Hoemmen, Mark F.; Hammond, Simon D.; Kim, Kyungjoo K.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Rajamanickam, Sivasankaran R.; Story, Shane S.; Knepper, Sarah K.; Guney, Murat G.; Hammond, Simon D.; Bradley, Andrew M.; Deveci, Mehmet D.; Costa, Tim C.; Kim, Kyungjoo K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

SIMD Scalar Types for Outer-loop Vectorization

Phipps, Eric T.; Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Tupek, Michael R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Kim, Kyungjoo K.; Costa, Timothy B.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Guney, Murat G.; Knepper, Sarah K.; Story, Shane S.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kim, Kyungjoo K.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

ShyLU: A Collection of Node-Scalable Sparse Linear Solvers

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Kim, Kyungjoo K.; Boman, Erik G.; Deveci, Mehmet D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Kim, Kyungjoo K.; Deveci, Mehmet D.; Trott, Christian R.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Intrepid2: Performance Portable Finite Element Discretization Library

Kim, Kyungjoo K.; Perego, Mauro P.; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

ECP 1.3.3.03a Develop General CS Components for ATDM Applications

Pawlowski, Roger P.; Bartlett, Roscoe B.; Bettencourt, Matthew T.; Carleton, James B.; Conde, Sidafa C.; Cyr, Eric C.; Kim, Kyungjoo K.; Mota, Alejandro M.; Perego, Mauro P.; Shadid, John N.; Sjaardema, Gregory D.; Toth, Alexander R.; Bradley, Andrew M.; Spotz, William S.; Ober, Curtis C.; Kalashnikova, Irina

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Kokkos Task API: A Use Case in Tacho

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Intrepid2: Towards Performance Portability

Kim, Kyungjoo K.; Perego, Mauro P.; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

KokkosKernels Introduction: Design API and Performance

Deveci, Mehmet D.; Rajamanickam, Sivasankaran R.; Kim, Kyungjoo K.; Bradley, Andrew M.; Trott, Christian R.; Hoemmen, Mark F.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Tacho: Two-level Task Parallel Cholesky Factorization

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Dohrmann, Clark R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan W.; Mackey, Greg; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Kim, Kyungjoo K.; Stelle, George

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factor- ization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG . Major chal- lenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

A Massively Parallel Scalable Implicit SPH Solver

Trask, Nathaniel T.; Maxey, Martin M.; Kim, Kyungjoo K.; Perego, Mauro P.; Parks, Michael L.; Yang, Kai Y.; Xu, Jinchao X.; Pan, Wenxiao P.; Tartakovsky, Alex T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

A comparison of high-level programming choices for incomplete sparse factorization across different architectures

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Booth, Joshua D.; Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.

All many-core systems require fine-grained shared memory parallelism, however the most efficient way to extract such parallelism is far from trivial. Fine-grained parallel algorithms face various performance trade-offs related to tasking, accesses to global data-structures, and use of shared cache. While programming models provide high level abstractions, such as data and task parallelism, algorithmic choices still remain open on how to best implement irregular algorithms, such as sparse factorizations, while taking into account the trade-offs mentioned above. In this paper, we compare these performance trade-offs for task and data parallelism on different hardware architectures such as Intel Sandy Bridge, Intel Xeon Phi, and IBM Power8. We do this by comparing the scaling of a new task-parallel incomplete sparse Cholesky factorization called Tacho and a new data-parallel incomplete sparse LU factorization called Basker. Both solvers utilize Kokkos programming model and were developed within the ShyLU package of Trilinos. Using these two codes we demonstrate how high-level programming changes affect performance and overhead costs on multiple multi/many-core systems. We find that Kokkos is able to provide comparable performance with both parallel-for and task/futures on traditional x86 multicores. However, the choice of which high-level abstraction to use on many-core systems depends on both the architectures and input matrices.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI DOI

Task and Data Parallelism Based Direct Solvers and Preconditioners in Manycore Architecture: Efforts in Trilinos/ShyLU

Booth, Joshua D.; Rajamanickam, Sivasankaran R.; Bradley, Andrew M.; Boman, Erik G.; Kim, Kyungjoo K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Publications