Publications Details

Publications / Conference Poster

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.