Publications Search

Profiling and Debugging Support for the Kokkos Programming Model

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Hammond, Simon; Trott, Christian R.; Ibanez-Granados, Daniel A.; Sunderland, Daniel

Supercomputing hardware is undergoing a period of significant change. In order to cope with the rapid pace of hardware and, in many cases, programming model innovation, we have developed the Kokkos Programming Model – a C++-based abstraction that permits performance portability across diverse architectures. Our experience has shown that the abstractions developed can significantly frustrate debugging and profiling activities because they break expected code proximity and layout assumptions. In this paper we present the Kokkos Profiling interface, a lightweight, suite of hooks to which debugging and profiling tools can attach to gain deep insights into the execution and data structure behaviors of parallel programs written to the Kokkos interface.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Towards a Scalable Integrated Simulation Framework for Extreme Heterogeneity in High Performance Computing

Hammond, Simon; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Hughes, Clayton; Levenhagen, Michael; Hoekstra, Robert J.; Ang, James A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

NALU Engineering Application Overview

Hammond, Simon; Hoekstra, Robert J.; Rodrigues, Arun; Ang, James A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Designing vector-friendly compact BLAS and LAPACK kernels

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2017

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14x, 45x, and 27x speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

DOE NNSA Vanguard Program

Foulk, James W.; Alvin, Kenneth F.; Foulk, James W.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Towards an Open Source Eco-System for Future HPC Designs

Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

KokkosKernels: Performance-Portable Sparse Dense and Graph Kernels

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Deveci, Mehmet; Hoemmen, Mark F.; Hammond, Simon; Kim, Kyungjoo; Trott, Christian R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

Rajamanickam, Sivasankaran; Story, Shane; Knepper, Sarah; Guney, Murat; Hammond, Simon; Bradley, Andrew M.; Deveci, Mehmet; Costa, Tim; Kim, Kyungjoo

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Fast linear algebra-based triangle counting with KokkosKernels

2017 IEEE High Performance Extreme Computing Conference, HPEC 2017

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

KKTri: Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Portable Line Smoother for Multiphysics Problems using Compact Batched BLAS

Kim, Kyungjoo; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron; Lang, Mike; Martin, Steve

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Tri-Lab Co-Design Milestone: In-Depth Performance Portability Analysis of Improved Integrated Codes on Advanced Architecture

Hoekstra, Robert J.; Hammond, Simon; Richards, David; Bergen, Ben

This milestone is a tri-lab deliverable supporting ongoing Co-Design efforts impacting applications in the Integrated Codes (IC) program element Advanced Technology Development and Mitigation (ATDM) program element. In FY14, the trilabs looked at porting proxy application to technologies of interest for ATS procurements. In FY15, a milestone was completed evaluating proxy applications in multiple programming models and in FY16, a milestone was completed focusing on the migration of lessons learned back into production code development. This year, the co-design milestone focuses on extracting the knowledge gained and/or code revisions back into production applications.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Evaluating Production Load Balancing Functions for Adaptive Mesh Schemes using Mini-Applications

Vaughan, Courtenay T.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Vanguard: Maturing the ARM Software Ecosystem for U.S. DOE Supercomputing

Foulk, James W.; Foulk, James W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.; Martinez, David; Noe, John P.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Sandia's ARM-centric Co-Design Strategy: Introduction to the NNSA/ASC Vanguard Project

Ang, James A.; Brightwell, Ronald B.; Hammond, Simon; Hemmert, Karl S.; Hoekstra, Robert J.; Foulk, James W.; Foulk, James W.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

1541 L2 Milestone: Thread Scalable Expression Assembly in Aria

Clausen, Jonathan; Brunini, Victor; Forster, Christopher J.; Noble, David R.; Trott, Christian R.; Hammond, Simon; Hoemmen, Mark F.; Lin, Paul T.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Performance Portable Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Trott, Christian R.; Hammond, Simon; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

On the Importance of Faster Atomics

Hammond, Simon; Trott, Christian R.; Edwards, Harold C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

The Impact of Increasing Memory System Diversity on Applications

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Frank, Michael P.; Hammond, Simon

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on Next-Generation Architectures

Garcia De Gonzalo, Simon; Hammond, Simon; Trott, Christian R.; Huw, Wen-Mei

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Fast Linear Algebra-Based Triangle Counting with KokkosKernels

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Prototyping the Next Generation of Aria

Clausen, Jonathan; Brunini, Victor; Forster, Christopher J.; Noble, David R.; Trott, Christian R.; Hammond, Simon; Hoemmen, Mark F.; Lin, Paul T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

SST Tutorial 2017 - Juno Example Processor

Hammond, Simon

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Sandia?s ARM?centric Co-Design Strategy

Ang, James A.; Hammond, Simon; Hoekstra, Robert J.; Foulk, James W.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Double Buffering for MCDRAM on Second Generation Intel Xeon Phi Processors with OpenMP

Olivier, Stephen L.; Hammond, Simon; Duran, Alejandro

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Structural Simulation Toolkit (SST)

Rodrigues, Arun; Moore, Branden J.; Hammond, Simon; Hemmert, Karl S.; Voskuilen, Gwendolyn R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Optimizing the Performance of Sparse-Matrix Vector Products on Next-Generation Processors

Hammond, Simon; Trott, Christian R.

Matrix-vector products are ubiquitous in high-performance scientific applications and have a growing set of occurrences in advanced data analysis activities. Achieving high performance for these kernels is therefore paramount, in part, because these operations can consume vast amounts of application execution time. In this report we document the development of several sparse-matrix vector product kernel implementations using a variety of programming models and approaches. Each kernel is run on a broad set of matrices selected to demonstrate the wide variety of matrix structure and sparsity that is possible with a single, generic kernel. For benchmarking and performance analysis, we utilize leading computing architectures for the NNSA/ASC program including Intel's Knights Landing processor and IBM's POWER8.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II

Deconinck, Adam; Nam, Hai A.; Mortin, Dave; Bonnie, Amanda; Lueninghoener, Cory; Brandt, James M.; Gentile, Ann C.; Foulk, James W.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon; Allan, Benjamin A.; Davis, Michael; Repik, Jason J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

ASC PI Meeting Panel Co-design for Exascale

Hoekstra, Robert J.; Hammond, Simon; Richards, David; Mccormick, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II (Paper)

Deconinck, Adam; Nam, Hai A.; Morton, David; Bonnie, Amanda; Lueninghoener, Cory; Brandt, James M.; Gentile, Ann C.; Foulk, James W.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon; Allan, Benjamin A.; Davis, Mike; Repik, Jason J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Next Generation Science Applications for the Next Generation of Supercomputing

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Pase, Douglas M.; Cook, Jeanine; Trott, Christian R.; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Next Generation Science Applications for the Next Generation of Supercomputing

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Pase, Douglas M.; Trott, Christian R.; Cook, Jeanine; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Journal of Parallel and Distributed Computing

Berry, Jonathan; Bender, Michael A.; Hammond, Simon; Hemmert, Karl S.; Mccauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” and if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Thus, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient. In this paper, we explore the design space for a user-controlled multi-level main memory. Our work identifies situations in which rewriting application kernels can provide significant performance gains when using near-memory. We present algorithms designed for two-level main memory, using divide-and-conquer to partition computations and streaming to exploit data locality. We consider algorithms for the fundamental application of sorting and for the data analysis kernel k-means. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories’ SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance improvements for our sorting algorithm. In contrast, the k-means algorithm is generally CPU bound and does not improve when using near-memory except under extreme conditions. These conditions require large instances that rule out SST simulation, but we demonstrate improvements by running on a customized machine with high and low bandwidth memory. These case studies in co-design serve as positive and cautionary templates, respectively, for the major task of optimizing the computational kernels of many fundamental applications for two-level main memory systems.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

Bowman and a Path to Trinity

Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

The Impact of Increasing Memory System Diversity on Applications

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Frank, Michael P.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Revisiting Online Autotuning for Sparse-Matrix Vector Multiplication Kernels on High-Performance Accelerators

Garcia De Gonzalo, Simon; Huw, Wen-Mei; Hammond, Simon; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Messier: A Detailed NVM-Based DIMM Model for the SST Simulation Framework

Awad, Amro; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon; Hoekstra, Robert J.; Hughes, Clayton

DRAM technology is the main building block of main memory, however, DRAM scaling is becoming very challenging. The main issues for DRAM scaling are the increasing error rates with each new generation, the geometric and physical constraints of scaling the capacitor part of the DRAM cells, and the high power consumption caused by the continuous need for refreshing cell values. At the same time, emerging Non- Volatile Memory (NVM) technologies, such as Phase-Change Memory (PCM), are emerging as promising replacements for DRAM. NVMs, when compared to current technologies e.g., NAND-based ash, have latencies comparable to DRAM. Additionally, NVMs are non-volatile, which eliminates the need for refresh power and enables persistent memory applications. Finally, NVMs have promising densities and the potential for multi-level cell (MLC) storage.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

KokkosKernels: Compact Layouts for Batched Blas and Sparse Matrix-Matrix multiply

Rajamanickam, Sivasankaran; Bradley, Andrew M.; Kim, Kyungjoo; Deveci, Mehmet; Trott, Christian R.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Evaluating Production Engineering Application Performance on the NNSA Trinity Advanced Technology System

Vaughan, Courtenay T.; Dinge, Dennis; Lin, Paul T.; Hammond, Simon; Pase, Douglas M.; Benner, Douglas E.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Preparing Sandia's Application Portfolio for the Future Using Kokkos

Trott, Christian R.; Edwards, Harold C.; Hammond, Simon; Sunderland, Daniel

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Codesign for Production Applications

Hammond, Simon; Trott, Christian R.; Vaughan, Courtenay T.; Dinge, Dennis; Lin, Paul T.; Pase, Douglas M.; Benner, Robert E.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Double buffering for MCDRAM on second generation intel® Xeon Phi™ processors with OpenMP

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Olivier, Stephen L.; Hammond, Simon; Duran, Alejandro

Emerging novel architectures for shared memory parallel computing are incorporating increasingly creative innovations to deliver higher memory performance. A notable exemplar of this phenomenon is the Multi-Channel DRAM (MCDRAM) that is included in the Intel® XeonPhi™ processors. In this paper, we examine techniques to use OpenMP to exploit the high bandwidth of MCDRAM by staging data. In particular, we implement double buffering using OpenMP sections and tasks to explicitly manage movement of data into MCDRAM. We compare our double-buffered approach to a non-buffered implementation and to Intel’s cache mode, in which the system manages the MCDRAM as a transparent cache. We also demonstrate the sensitivity of performance to parameters such as dataset size and the distribution of threads between compute and copy operations.

More Details

TYPE Conference Poster YEAR 2017

OSTI Scopus

Designing Vector-Friendly Compact BLAS and LAPACK Kernels

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Kim, Kyungjoo; Costa, Timothy B.; Deveci, Mehmet; Bradley, Andrew M.; Hammond, Simon; Guney, Murat E.; Knepper, Sarah; Story, Shane; Rajamanickam, Sivasankaran

Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus