Publications

Results 26–50 of 64

Search results

Jump to search filters

Designing vector-friendly compact BLAS and LAPACK kernels

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Kim, Kyungjoo K.; Costa, Timothy B.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran R.

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo K.; Costa, Timothy B.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran R.

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan W.; Mackey, Greg; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Kim, Kyungjoo K.; Stelle, George

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factorization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG. Major challenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

Intercomparison of 3D pore-scale flow and solute transport simulation methods

Advances in Water Resources

Mehmani, Yashar; Schonherr, Martin; Pasquali, Andrea; Clark, Colin; Perkins, William A.; Kim, Kyungjoo K.; Perego, Mauro P.; Parks, Michael L.; Balhoff, Matthew T.; Richmond, Marshall C.; Geier, Martin; Krafczyk, Manfred; Luo, Li S.; Tartakovsky, Alexandre M.; Winter, C.L.

Multiple numerical approaches have been developed to simulate porous media fluid flow and solute transport at the pore scale. These include 1) methods that explicitly model the three-dimensional geometry of pore spaces and 2) methods that conceptualize the pore space as a topologically consistent set of stylized pore bodies and pore throats. In previous work we validated a model of the first type, using computational fluid dynamics (CFD) codes employing a standard finite volume method (FVM), against magnetic resonance velocimetry (MRV) measurements of pore-scale velocities. Here we expand that validation to include additional models of the first type based on the lattice Boltzmann method (LBM) and smoothed particle hydrodynamics (SPH), as well as a model of the second type, a pore-network model (PNM). The PNM approach used in the current study was recently improved and demonstrated to accurately simulate solute transport in a two-dimensional experiment. While the PNM approach is computationally much less demanding than direct numerical simulation methods, the effect of conceptualizing complex three-dimensional pore geometries on solute transport in the manner of PNMs has not been fully determined. We apply all four approaches (FVM-based CFD, LBM, SPH and PNM) to simulate pore-scale velocity distributions and (for capable codes) nonreactive solute transport, and intercompare the model results. Comparisons are drawn both in terms of macroscopic variables (e.g., permeability, solute breakthrough curves) and microscopic variables (e.g., local velocities and concentrations). Generally good agreement was achieved among the various approaches, but some differences were observed depending on the model context. The intercomparison work was challenging because of variable capabilities of the codes, and inspired some code enhancements to allow consistent comparison of flow and transport simulations across the full suite of methods. This study provides support for confidence in a variety of pore-scale modeling methods and motivates further development and application of pore-scale simulation methods.

More Details
Results 26–50 of 64
Results 26–50 of 64