Publications Search

The Portals 4.3 Network Programming Interface

Schonbein, William W.; Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan E.; Hemmert, Karl S.; Foulk, James W.; Underwood, Keith; Riesen, Rolf; Hoefler, Torsten; Barbe, Mathieu; Suraty Filho, Luiz H.; Ratchov, Alexandre; Maccabe, Arthur B.

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI

HIHE01-36: Evaluate how various topologies perform in the context of link failures [Slides]

Hemmert, Karl S.; Kenny, Joseph

Study looks at the effect that failed links have on the throughput of HPC systems: What workloads are most effected? How many links need to be down before throughput of the machine is noticeably affected?

More Details

TYPE Other Report YEAR 2021

DOI OSTI

Computational Offload with BlueField Smart NICs

Karamati, Sara; Young, Jeffrey; Conte, Tom; Hemmert, Karl S.; Grant, Ryan; Hughes, Clayton; Vuduc, Rich

The recent introduction of a new generation of "smart NICs" have provided new accelerator platforms that include CPU cores or reconfigurable fabric in addition to traditional networking hardware and packet offloading capabilities. While there are currently several proposals for using these smartNICs for low-latency, in-line packet processing operations, there remains a gap in knowledge as to how they might be used as computational accelerators for traditional high-performance applications. This work aims to look at benchmarks and mini-applications to evaluate possible benefits of using a smartNIC as a compute accelerator for HPC applications. We investigate NVIDIA's current-generation BlueField-2 card, which includes eight Arm CPUs along with a small amount of storage, and we test the networking and data movement performance of these cards compared to a standard Intel server host. We then detail how two different applications, YASK and miniMD can be modified to make more efficient use of the BlueField-2 device with a focus on overlapping computation and communication for operations like neighbor building and halo exchanges. Our results show that while the overall compute performance of these devices is limited, using them with a modified miniMD algorithm allows for potential speedups of 5 to 20% over the host CPU baseline with no loss in simulation accuracy.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

A-SST Initial Specification

Rodrigues, Arun; Hammond, Simon; Hemmert, Karl S.; Hughes, Clayton; Kenny, Joseph; Voskuilen, Gwendolyn R.

The U.S. Army Research Office (ARO), in partnership with IARPA, are investigating innovative, efficient, and scalable computer architectures that are capable of executing next-generation large scale data-analytic applications. These applications are increasingly sparse, unstructured, non-local, and heterogeneous. Under the Advanced Graphic Intelligence Logical computing Environment (AGILE) program, Performer teams will be asked to design computer architectures to meet the future needs of the DoD and the Intelligence Community (IC). This design effort will require flexible, scalable, and detailed simulation to assess the performance, efficiency, and validity of their designs. To support AGILE, Sandia National Labs will be providing the AGILE-enhanced Structural Simulation Toolkit (A-SST). This toolkit is a computer architecture simulation framework designed to support fast, parallel, and multi-scale simulation of novel architectures. This document describes the A-SST framework, some of its library of simulation models, and how it may be used by AGILE Performers.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

Towards an Extensible Framework for Accelerated System Simulation

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hughes, Clayton; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

SST-ExplorerEnabling System-level Performance and Reliability Analysis for Designs with Real-World IPs

Rodrigues, Arun; Awad, Amro; Hughes, Clayton; Agarwal, Sapan; Skoufis, Michael; Voskuilen, Gwendolyn R.; Nema, Shubham; Razdan, Rohin; Gardner, Alan; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

ERAS: Enabling the Integration of Real-World Intellectual Properties (IPs) in Architectural Simulators

Nema, Shubham; Razdan, Rohin; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Adak, Debratim; Hammond, Simon; Awad, Amro; Hughes, Clayton

Sandia National Laboratories is investigating scalable architectural simulation capabilities with a focus on simulating and evaluating highly scalable supercomputers for high performance computing applications. There is a growing demand for RTL model integration to provide the capability to simulate customized node architectures and heterogeneous systems. This report describes the first steps integrating the ESSENTial Signal Simulation Enabled by Netlist Transforms (ESSENT) tool with the Structural Simulation Toolkit (SST). ESSENT can emit C++ models from models written in FIRRTL to automatically generate components. The integration workflow will automatically generate the SST component and necessary interfaces to ’plug’ the ESSENT model into the SST framework.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

SST-ExplorerEnabling System-level Performance and Reliability Analysis for Designs with Real-World IPs

Rodrigues, Arun; Awad, Amro; Hughes, Clayton; Agarwal, Sapan; Skoufis, Michael; Voskuilen, Gwendolyn R.; Nema, Shubham; Razdan, Rohin; Gardner, Alan; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

Evaluating Trade-offs in Potential Exascale Interconnect Technologies

Hemmert, Karl S.; Bair, Ray; Bhatale, Abhinav; Groves, Taylor; Jain, Nikhil; Lewis, Cannada; Mubarak, Misbah; Pakin, Scott D.; Ross, Robert; Wilke, Jeremiah

This report details work to study trade-offs in topology and network bandwidth for potential interconnects in the exascale (2021-2022) timeframe. The work was done using multiple interconnect models across two parallel discrete event simulators. Results from each independent simulator are shown and discussed and the areas of agreement and disagreement are explored.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Developing SST Element Libraries

Voskuilen, Gwendolyn R.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

The Exascale Computing Project: Hardware Evaluation for Interconnects

Hemmert, Karl S.; Wilke, Jeremiah; Ross, Rob; Groves, Taylor; Karlin, Ian

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Abstract Machine Models and Proxy Architectures for Exascale Computing

Ang, James A.; Barrett, Richard F.; Benner, Robert E.; Burke, Daniel; Chan, Cy; Cook, Jeanine; Daley, Christopher S.; Donofrio, David; Hammond, Simon; Hemmert, Karl S.; Hoekstra, Robert J.; Ibrahim, Khaled; Kelly, Suzanne M.; Le, Hoang; Leung, Vitus J.; Michelogiannakis, George; Resnick, David R.; Rodrigues, Arun; Shalf, John; Stark, Dylan; Unat, D.; Wright, Nick J.; Voskuilen, Gwendolyn R.

To achieve exascale computing, fundamental hardware architectures must change. The most significant consequence of this assertion is the impact on the scientific and engineering applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware architects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of models that can help software developers prepare for exascale. In addition, through the use of proxy architectures, we can enable a more concrete exploration of how well new and evolving application codes map onto future architectures. This second version of the document addresses system scale considerations and provides a system-level abstract machine model with proxy architecture information.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

ASC CSSE Milestone 6812: SST-GPGPU

Hughes, Clayton; Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

ASC CSSE Level 2 Milestone Briefing: SST-GPU

Hughes, Clayton; Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Towards Lightweight and Scalable Simulation of Large-Scale OpenSHMEM Applications

Levenhagen, Michael; Hammond, Simon; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ECP HE Node Simulation - SNL

Hughes, Clayton; Rodrigues, Arun; Voskuilen, Gwendolyn R.; Hemmert, Karl S.; Hammond, Simon; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

The Portals 4.2 Network Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Riesen, Rolf; Hoefler, Torsten; Maccabe, Arthur B.; Hudson, Trammell

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Towards Lightweight and Scalable Simulation of Large-Scale OpenSHMEM Applications

Levenhagen, Michael; Hammond, Simon; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Structural Simulation Toolkit (SST) Tutorial

Hammond, Simon; Rodrigues, Arun; Voskuilen, Gwendolyn R.; Hemmert, Karl S.; Levenhagen, Michael; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Trinity: Opportunities and Challenges of a Heterogeneous System

Hemmert, Karl S.; Moore, Stan G.; Gallis, Michael A.; Davis, Mike E.; Levesque, John; Hjelm, Nathan; Lujan, James; Morton, David; Nam, Hai A.; Parga, Alex; Peltz Jr., Paul; Shipman, Galen; Torrez, Alfred

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Merlin Element Library Deep Dive

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Trinity Update: Open Science Burst Buffers Intel Xeon Phi Processor Plans

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Trinity: Opportunities and Challenges of a Heterogeneous System

Hemmert, Karl S.; Moore, Stan G.; Gallis, Michael A.; Davis, Mike E.; Levesque, John; Hjelm, Nathan; Lujan, James; Morton, David; Nam, Hai A.; Parga, Alex; Peltz Jr., Paul; Shipman, Galen; Torrez, Alfred

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Interconnect Working Group

Hemmert, Karl S.; Bair, Ray; Bhatele, Abhinav; Groves, Taylor; Hammond, Simon; Jain, Nikhil; Levenhagen, Michael; Mubarak, Misbah; Pakin, Scott; Ross, Rob; Wilke, Jeremiah

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

SST Simulation Framework (and Complex Memory)

Hammond, Simon; Hughes, Clayton; Awad, Amro; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Levenhagen, Michael; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Analyzing Exascale Memory Architectures Using the SST Toolkit

Hughes, Clayton; Awad, Amro; Hammond, Simon; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Towards a Scalable Integrated Simulation Framework for Extreme Heterogeneity in High Performance Computing

Hammond, Simon; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Hughes, Clayton; Levenhagen, Michael; Hoekstra, Robert J.; Ang, James A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron; Lang, Mike; Martin, Steve

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Vanguard: Maturing the ARM Software Ecosystem for U.S. DOE Supercomputing

Foulk, James W.; Foulk, James W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.; Martinez, David; Noe, John P.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Sandia's ARM-centric Co-Design Strategy: Introduction to the NNSA/ASC Vanguard Project

Ang, James A.; Brightwell, Ronald B.; Hammond, Simon; Hemmert, Karl S.; Hoekstra, Robert J.; Foulk, James W.; Foulk, James W.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Trinity Architecture

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Unveiling the Interplay between Global Link Arrangements and Network Management Algorithms on Dragonfly Networks

Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017

Kaplan, Fulya; Tuncer, Ozan; Leung, Vitus J.; Hemmert, Karl S.; Coskun, Ayse K.

Network messaging delay historically constitutes a large portion of the wall-clock time for High Performance Computing (HPC) applications, as these applications run on many nodes and involve intensive communication among their tasks. Dragonfly network topology has emerged as a promising solution for building exascale HPC systems owing to its low network diameter and large bisection bandwidth. Dragonfly includes local links that form groups and global links that connect these groups via high bandwidth optical links. Many aspects of the dragonfly network design are yet to be explored, such as the performance impact of the connectivity of the global links, i.e., global link arrangements, the bandwidth of the local and global links, or the job allocation algorithm. This paper first introduces a packet-level simulation framework to model the performance of HPC applications in detail. The proposed framework is able to simulate known MPI (message passing interface) routines as well as applications with custom-defined communication patterns for a given job placement algorithm and network topology. Using this simulation framework, we investigate the coupling between global link bandwidth and arrangements, communication pattern and intensity, job allocation and task mapping algorithms, and routing mechanisms in dragonfly topologies. We demonstrate that by choosing the right combination of system settings and workload allocation algorithms, communication overhead can be decreased by up to 44%. We also show that circulant arrangement provides up to 15% higher bisection bandwidth compared to the other arrangements, but for realistic workloads, the performance impact of link arrangements is less than 3%.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Structural Simulation Toolkit (SST)

Rodrigues, Arun; Moore, Branden J.; Hammond, Simon; Hemmert, Karl S.; Voskuilen, Gwendolyn R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Journal of Parallel and Distributed Computing

Berry, Jonathan; Bender, Michael A.; Hammond, Simon; Hemmert, Karl S.; Mccauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” and if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Thus, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient. In this paper, we explore the design space for a user-controlled multi-level main memory. Our work identifies situations in which rewriting application kernels can provide significant performance gains when using near-memory. We present algorithms designed for two-level main memory, using divide-and-conquer to partition computations and streaming to exploit data locality. We consider algorithms for the fundamental application of sorting and for the data analysis kernel k-means. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories’ SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance improvements for our sorting algorithm. In contrast, the k-means algorithm is generally CPU bound and does not improve when using near-memory except under extreme conditions. These conditions require large instances that rule out SST simulation, but we demonstrate improvements by running on a customized machine with high and low bandwidth memory. These case studies in co-design serve as positive and cautionary templates, respectively, for the major task of optimizing the computational kernels of many fundamental applications for two-level main memory systems.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

The Portals 4.1 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammel

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

SST Update

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

HT Breakout Discussions Modeling and Performance Tools

Hemmert, Karl S.; Schulz, Martin

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Unveiling the Interplay Between Global Link Arrangements and Network Management Algorithms on Dragonfly Networks

Kaplan, Fulya; Tuncer, Ozan; Leung, Vitus J.; Hemmert, Karl S.; Coskun, Aysek

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Stalled Active and Idle (SAI): Characterizing Large-scale Dragonfly Networks

Groves, Taylor L.; Hammond, Simon; Hemmert, Karl S.; Grant, Ryan; Levenhagen, Michael; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

(SAI) Stalled Active and Idle: Characterizing Power and Performance of Large-Scale Dragonfly Networks

Groves, Taylor L.; Grant, Ryan; Hemmert, Karl S.; Hammond, Simon; Levenhagen, Michael; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

Trinity: Architecture and Early Experience

Hemmert, Karl S.; Rajan, Mahesh; Hoekstra, Robert J.; Dawson, Shawn; Vigil, Manuel; Grunau, Daryl; Lujan, James; Morton, David; Nam, Hai A.; Peltz Jr., Paul; Torrez, Alfred; Wright, Cornell; Glass, Micheal W.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Trinity: Architecture and Early Experience

Hemmert, Karl S.; Rajan, Mahesh; Hoekstra, Robert J.; Dawson, Shawn; Vigil, Manuel; Grunau, Daryl; Lujan, James; Morton, David; Nam, Hai A.; Peltz Jr., Paul; Torrez, Alfred; Wright, Cornell; Glass, Micheal W.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Trinity: Architecture and Early Experience

Hemmert, Karl S.; Rajan, Mahesh; Hoekstra, Robert J.; Dawson, Shawn; Vigil, Manuel; Grunau, Daryl; Lujan, James; Morton, David; Nam, Hai A.; Peltz Jr., Paul; Torrez, Alfred; Wright, Cornell

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Miranda: a lightweight processor

Voskuilen, Gwendolyn R.; Moore, Branden J.; Rodrigues, Arun; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Structural Simulation Toolkit (SST)

Rodrigues, Arun; Voskuilen, Gwendolyn R.; Hammond, Simon; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Multi-Level Memory ? The Next Opportunity for Performance?

Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Trott, Christian R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Performance and Modeling Tools (SST & PerfMiner)

Rodrigues, Arun; Cook, Jeanine; Hammond, Simon; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation

Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2015

Bender, Michael A.; Berry, Jonathan; Hammond, Simon; Hemmert, Karl S.; Mccauley, Samuel; Moore, Branden J.; Moseley, Benjamin; Phillips, Cynthia A.; Resnick, David R.; Rodrigues, Arun

A fundamental challenge for supercomputer architecture is that processors cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. As the number of cores per chip increases, and traditional DDR DRAM speeds stagnate, the problem is only getting worse. A variety of non-DDR 3D memory technologies (Wide I/O 2, HBM) offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems likely that future systems will require at least two levels of main memory: high-bandwidth, low-power memory near the processor and low-bandwidth high-capacity memory further away. This near memory will probably not have significantly faster latency than the far memory. This, combined with the large size of the near memory (multiple GB) and power constraints, may make it difficult to treat it as a standard cache. In this paper, we explore some of the design space for a user-controlled multi-level main memory. We present algorithms designed for the heterogeneous bandwidth, using streaming to exploit data locality. We consider algorithms for the fundamental application of sorting. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories' SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance. This co-design effort suggests implementing two-level main memory systems may improve memory performance in fundamental applications.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI Scopus

Structural Simulation Toolkit. Lunch & Learn

Moore, Branden J.; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon; Hemmert, Karl S.

This is a presentation outlining a lunch and learn lecture for the Structural Simulation Toolkit, supported by Sandia National Laboratories.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

ASCR Computer Architecture Laboratory

Hammond, Simon; Ang, James A.; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Cook, Jeanine

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Trinity Platform Overview

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Ember: Reference Communication Patterns for Exascale

Hammond, Simon; Hemmert, Karl S.; Levenhagen, Michael; Rodrigues, Arun; Voskuilen, Gwendolyn R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Structural Simulation Toolkit

Voskuilen, Gwendolyn R.; Hammond, Simon; Rodrigues, Arun; Moore, Branden J.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Sandia?s Open Source Co-Design Capabilities

Ang, James A.; Foulk, James W.; Hemmert, Karl S.; Hammond, Simon; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

DOE's Fast Forward and Design Forward R&D Projects: Influence Exascale Hardware

Ang, James A.; Hammond, Simon; Hemmert, Karl S.; Foulk, James W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

An evaluation of MPI message rate on hybrid-core processors

International Journal of High Performance Computing Applications

Brightwell, Ronald B.; Barrett, Brian W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

TYPE Journal Article YEAR 2014

DOI OSTI Scopus

The Structural Simulation Toolkit

Rodrigues, Arun; Moore, Branden J.; Hammond, Simon; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

DOI OSTI

Abstract machine models and proxy architectures for exascale computing

Ang, James A.; Barrett, Richard F.; Benner, Robert E.; Burke, D.; Chan, C.; Donofrio, David; Hammond, Simon; Hemmert, Karl S.; Kelly, Suzanne M.; Le, H.; Leung, Vitus J.; Resnick, David R.; Rodrigues, Arun; Shalf, John; Stark, Dylan T.; Unat, Didem; Wright, N.J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

The Sandia Co-design Ecosystem

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Using a complementary emulation-simulation co-design approach to assess application readiness for Processing-in-Memory systems

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stelle, George W.; Olivier, Stephen L.; Stark, Dylan T.; Rodrigues, Arun; Hemmert, Karl S.

Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

XGC Overview

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Extreme-scale Computing Grand Challenge LDRD (XGC)

Hemmert, Karl S.; Barrett, Brian; Barrett, Richard F.; Lentine, Anthony L.; Rodrigues, Arun; Denton-Hill, Kim M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

The Impact of Hybrid-Core Processors on MPI Message Rate

Barrett, Brian; Brightwell, Ronald B.; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The portals 4.0.1 network programming interface

Barrett, Brian; Brightwell, Ronald B.; Pedretti, Kevin; Hemmert, Karl S.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities. 3

More Details

TYPE SAND Report YEAR 2013

DOI OSTI

Interconnect Challenges at Exascale

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Using the Cray Gemini Performance Counters

Pedretti, Kevin; Vaughan, Courtenay T.; Barrett, Richard F.; Devine, Karen; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

The impact of hybrid-core processors on MPI message rate

ACM International Conference Proceeding Series

Barrett, Brian; Brightwell, Ronald B.; Hammond, Simon; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

TYPE Conference YEAR 2013

OSTI Scopus

Using the Cray Gemini Performance Counters

Pedretti, Kevin; Vaughan, Courtenay T.; Hemmert, Karl S.; Barrett, Richard F.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

The Structural Simulation Toolkit

Proposed for publication in SIGMETRICS Performance Evaluation Review.

Rodrigues, Arun; Hemmert, Karl S.; Barrett, Brian; Oldfield, Ron

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

The Portals 4.0 Network Programming Interface

Brightwell, Ronald B.; Pedretti, Kevin; Wheeler, Kyle B.; Hemmert, Karl S.; Barrett, Brian

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia’s Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2012

DOI OSTI

Portals 4 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Tyranny of Benchmarks. Past Present and Future Challenges in HPC

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

XGC Status Overview

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Extreme-Scale Computing Grand Challenge (XGC) Energy-Efficient Data Movement for Next-Generation Computing

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Exascale Design Space Exploration and Co-design

Proposed for publication in Future Generation Computer Systems.

Barrett, Richard F.; Trucano, Timothy G.; Doerfler, Douglas W.; Dosanjh, Sudip S.; Hammond, Simon; Hemmert, Karl S.; Heroux, Michael A.; Lin, Paul T.; Pedretti, Kevin P.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Overview of the XGC project

Shinde, Subhash L.; Ang, James A.; Barrett, Brian; Barrett, Richard F.; Denton-Hill, Kim M.; Lentine, Anthony L.; Murphy, Richard C.; Rodrigues, Arun; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Co-design in the Tri-lab Networking Environment

Hemmert, Karl S.; Naegle, John H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Early Results from the ACES Interconnection Network Project

Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Improvements to the Structural Simulation Toolkit

Rodrigues, Arun; Leung, Vitus J.; Levenhagen, Michael; Ferreira, Kurt; Hemmert, Karl S.; Barrett, Brian

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Improvements to the Structural Simulation Toolkit

Rodrigues, Arun; Leung, Vitus J.; Levenhagen, Michael; Ferreira, Kurt; Hemmert, Karl S.; Barrett, Brian

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467: demonstration of a legacy application's path to exascale

Barrett, Brian; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron; Pedretti, Kevin T.T.; Rodrigues, Arun; Barrett, Richard F.; Thompson, David; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

DOI OSTI

Exascale Computing and the Role of Co-Design

Proposed for publication in Advances in Parallel Computing.

Ang, James A.; Brightwell, Ronald B.; Dosanjh, Sudip S.; Hemmert, Karl S.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Trinity Architecture & Design

Doerfler, Douglas W.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Enabling flexible collective communication offload with triggered operations

Proceedings - Symposium on the High Performance Interconnects, Hot Interconnects

Underwood, Keith D.; Coffman, Jerrie; Larsen, Roy; Hemmert, Karl S.; Barrett, Brian W.; Brightwell, Ronald B.; Levenhagen, Michael

Low latency collective communications are key to application scalability. As systems grow larger, minimizing collective communication time becomes increasingly challenging. Offload is an effective technique for accelerating collective operations; however, algorithms for collective communication constantly evolve such that flexible implementations are critical. This paper presents triggered operations-a semantic building block that allows the key components of collective communications to be offloaded while allowing the host side software to define the algorithm. Simulations are used to demonstrate the performance improvements achievable through the offload of MPI-Allreduce using these building blocks. © 2011 IEEE.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Using triggered operations to offload rendezvous messages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Barrett, Brian; Brightwell, Ronald B.; Hemmert, Karl S.; Wheeler, Kyle B.

Historically, MPI implementations have had to choose between eager messaging protocols that require buffering and rendezvous protocols that sacrifice overlap and strong independent progress in some scenarios. The typical choice is to use an eager protocol for short messages and switch to a rendezvous protocol for long messages. If overlap and progress are desired, some implementations offer the option of using a thread. We propose an approach that leverages triggered operations to implement a long message rendezvous protocol that provides strong progress guarantees. The results indicate that a triggered operation based rendezvous can achieve better overlap than a traditional rendezvous implementation and less wasted bandwidth than an eager long protocol. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2011

OSTI Scopus