Publications

Results 151–175 of 272

Search results

Jump to search filters

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo K.; Costa, Timothy B.; Deveci, Mehmet D.; Bradley, Andrew M.; Hammond, Simon D.; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran R.

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

Samba: A Detailed Memory Management Unit (MMU) for the SST Simulation Framework

Awad, Amro A.; Voskuilen, Gwendolyn R.; Hammond, Simon D.; Hoekstra, Robert J.

The Memory Management Unit (MMU) is one of the most important parts of any modern computing system. It can be thought of as the hardware support for virtual memory, which enforces access permissions, and manages the translation of processes' virtual addresses into real memory physical addresses. The operating system (OS) allocates the physical memory in a small granularity frames that can be accessed by processes with appropriate permissions. Each process has the illusion of having an entire memory space, however, the actual addresses issued by the process, (`virtual addresses'), are translated into the physical frame address allocated by the OS through virtual memory.

More Details

Analyzing allocation behavior for multi-level memory

ACM International Conference Proceeding Series

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon D.

Managing multi-level memories will require different policies from those used for cache hierarchies, as memory technologies differ in latency, bandwidth, and volatility. To this end we analyze application data allocations and main memory accesses to determine whether an application-driven approach to managing a multi-level memory system comprising stacked and conventional DRAM is viable. Our early analysis shows that the approach is viable, but some applications may require dynamic allocations (i.e., migration) while others are amenable to static allocation.

More Details

Evaluating the Opportunities for Multi-Level Memory – An ASC 2016 L2 Milestone

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Frank, Michael P.; Hammond, Simon D.

The next two Advanced Technology platforms for the ASC program will feature complex memory hierarchies – in the Trinity supercomputer being deployed in 2016, Intel’s Knights Landing processors will feature 16GB of on-package, high-bandwidth memory, combined with a larger capacity DDR4 memory and in 2018, the Sierra machine deployed at Lawrence Livermore National Laboratory will feature powerful compute nodes containing POWER9 processors with large capacity memories and an array of coherent GPU accelerators also with high bandwidth memories.

More Details
Results 151–175 of 272
Results 151–175 of 272