Publications Search

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

SIMD Scalar Types for Outer-loop Vectorization

Phipps, Eric T.; Kim, Kyungjoo; Rajamanickam, Sivasankaran; Tupek, Michael R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Applications of Compact Batched Kernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Rajamanickam, Sivasankaran; Story, Shane; Knepper, Sarah; Guney, Murat; Hammond, Simon; Bradley, Andrew M.; Deveci, Mehmet; Costa, Tim; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Performance-Portable Sparse Dense and Graph Kernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Hoemmen, Mark F.; Hammond, Simon; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Kokkoskernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kim, Kyungjoo; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

ShyLU: A Collection of Node-Scalable Sparse Linear Solvers

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Boman, Erik G.; Deveci, Mehmet

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Intrepid2: Performance Portable Finite Element Discretization Library

Kim, Kyungjoo; Perego, Mauro; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

ECP 1.3.3.03a Develop General CS Components for ATDM Applications

Pawlowski, Roger; Bartlett, Roscoe; Bettencourt, Matthew T.; Carleton, James B.; Conde, Sidafa; Cyr, Eric C.; Kim, Kyungjoo; Mota, Alejandro; Perego, Mauro; Shadid, John N.; Sjaardema, Gregory D.; Toth, Alexander R.; Bradley, Andrew M.; Spotz, William S.; Ober, Curtis C.; Tezaur, Irina K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Tacho: Two-level Task Parallel Cholesky Factorization

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Dohrmann, Clark R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Kokkos Task API: A Use Case in Tacho

Kim, Kyungjoo; Rajamanickam, Sivasankaran; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

KokkosKernels Introduction: Design API and Performance

Deveci, Mehmet; Rajamanickam, Sivasankaran; Kim, Kyungjoo; Bradley, Andrew M.; Trott, Christian R.; Hoemmen, Mark F.; Boman, Erik G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Intrepid2: Towards Performance Portability

Kim, Kyungjoo; Perego, Mauro; Ellingwood, Nathan D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan; Mackey, Greg E.; Rajamanickam, Sivasankaran; Wolf, Michael; Kim, Kyungjoo; Stelle, George W.

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factorization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG. Major challenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

A Massively Parallel Scalable Implicit SPH Solver

Trask, Nathaniel; Maxey, Martin; Kim, Kyungjoo; Perego, Mauro; Parks, Michael L.; Yang, Kai; Xu, Jinchao; Pan, Wenxiao; Tartakovsky, Alex

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Publications

Search results