Publications Search

A Study on the Integration between SIMT and Scalar Cores: Loosely Coupled to Tightly Coupled

Ramshanker, Abinands; Chetput Venkataraghaven, Sooraj; Hughes, Clayton; Foulk, James W.; Rogers, Timothy

Abstract not provided.

More Details

TYPE Conference Proceeding YEAR 2024

OSTI

SST Tutorial

Lavin, Patrick R.; Hemmert, Karl S.; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2024

DOI OSTI

“Smarter” NICs for faster algorithms [Slides]

Karamati, Sara; Young, Jeffrey L.; Vuduc, Rich; Hemmert, Karl S.; Schonbein, William W.; Siefert, Christopher; Levy, Scott L.N.; Hughes, Clayton

The basic building block of a distributed-memory cluster or supercomputer is a node. Each node includes a host, which is a processor (xPU) + memory hierarchy. The host can communicate with other hosts via its NIC (network interface controller). A network connects the nodes. The nodes may be arranged in some topology, which determines the network’s carrying capacity and cost.

More Details

TYPE Other Report YEAR 2023

DOI OSTI

Evaluation of HPC Workloads Running on Open-Source RISC-V Hardware

Foulk, James W.; Berger-Vergiat, Luc; Feinberg, Benjamin; Hughes, Clayton; Levenhagen, Michael

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2023

OSTI

ERAS: Enabling Integration of Real-World Intellectual Properties in Architectural Simulators -- SST Introduction

Hughes, Clayton; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Nema, Shubham; Awad, Amro; Kaushik Chunduru, Shiva

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

ERAS: Enabling Integration of Real-World Intellectual Properties in Architectural Simulators -- Osseus Introduction

Hughes, Clayton; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Feinberg, Benjamin; Nema, Shubham; Awad, Amro; Kirschner, Justin; Adak, Debpratim

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2023

DOI OSTI

Co-Designing Open-Source Hardware With The Structural Simulation Toolkit

Hughes, Clayton; Voskuilen, Gwendolyn R.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2023

DOI OSTI

ATHNEA: Enabling Codesign for Next-Generation AI/ML Architectures

Plagge, Mark; Feinberg, Benjamin; Rothganger, Fredrick R.; Agarwal, Sapan; Hughes, Clayton; Cardwell, Suma G.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

ATHENA: An Analytical Analog Neuromorphic Hardware Estimation Tool

Plagge, Mark; Cardwell, Suma G.; Hughes, Clayton; Agarwal, Sapan

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2022

OSTI

ATHENA: Analytical Tool for Heterogeneous Neuromorphic Architectures

Cardwell, Suma G.; Plagge, Mark; Hughes, Clayton; Rothganger, Fredrick R.; Agarwal, Sapan; Feinberg, Benjamin; Awad, Amro; Mcfarland, John; Parker, Luke

The ASC program seeks to use machine learning to improve efficiencies in its stockpile stewardship mission. Moreover, there is a growing market for technologies dedicated to accelerating AI workloads. Many of these emerging architectures promise to provide savings in energy efficiency, area, and latency when compared to traditional CPUs for these types of applications — neuromorphic analog and digital technologies provide both low-power and configurable acceleration of challenging artificial intelligence (AI) algorithms. If designed into a heterogeneous system with other accelerators and conventional compute nodes, these technologies have the potential to augment the capabilities of traditional High Performance Computing (HPC) platforms [5]. This expanded computation space requires not only a new approach to physics simulation, but the ability to evaluate and analyze next-generation architectures specialized for AI/ML workloads in both traditional HPC and embedded ND applications. Developing this capability will enable ASC to understand how this hardware performs in both HPC and ND environments, improve our ability to port our applications, guide the development of computing hardware, and inform vendor interactions, leading them toward solutions that address ASC’s unique requirements.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI

Data Transfers and Host/Device Communication using OneAPI for FPGA

Lane, Phillip A.; Siefert, Christopher; Olivier, Stephen L.; Hughes, Clayton; Foulk, James W.; Voskuilen, Gwendolyn R.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Modeling Analog Tile-Based Accelerators Using SST

Feinberg, Benjamin; Agarwal, Sapan; Plagge, Mark; Rothganger, Fredrick R.; Cardwell, Suma G.; Hughes, Clayton

Analog computing has been widely proposed to improve the energy efficiency of multiple important workloads including neural network operations, and other linear algebra kernels. To properly evaluate analog computing and explore more complex workloads such as systems consisting of multiple analog data paths, system level simulations are required. Moreover, prior work on system architectures for analog computing often rely on custom simulators creating signficant additional design effort and complicating comparisons between different systems. To remedy these issues, this report describes the design and implementation of a flexible tile-based analog accelerator element for the Structural Simulation Toolkit (SST). The element focuses on heavily on the tile controller—an often neglected aspect of prior work—that is sufficiently versatile to simulate a wide range of different tile operations including neural network layers, signal processing kernels, and generic linear algebra operations without major constraints. The tile model also interoperates with existing SST memory and network models to reduce the overall development load and enable future simulation of heterogeneous systems with both conventional digital logic and analog compute tiles. Finally, both the tile and array models are designed to easily support future extensions as new analog operations and applications that can benefit from analog computing are developed.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI

Unified Memory: GPGPU-Sim/UVM Smart Integration

Liu, Yechen; Rogers, Timothy; Hughes, Clayton

CPU/GPU heterogeneous compute platforms are an ubiquitous element in computing and a programming model specified for this heterogeneous computing model is important for both performance and programmability. A programming model that exposes the shared, unified, address space between the heterogeneous units is a necessary step in this direction as it removes the burden of explicit data movement from the programmer while maintaining performance. GPU vendors, such as AMD and NVIDIA, have released software-managed runtimes that can provide programmers the illusion of unified CPU and GPU memory by automatically migrating data in and out of the GPU memory. However, this runtime support is not included in GPGPU-Sim, a commonly used framework that models the features of a modern graphics processor that are relevant to non-graphics applications. UVM Smart was developed, which extended GPGPU-Sim 3.x to in- corporate the modeling of on-demand pageing and data migration through the runtime. This report discusses the integration of UVM Smart and GPGPU-Sim 4.0 and the modifications to improve simulation performance and accuracy.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI

Simulating Next-Gen Dataflow Architectures for HPC

Hughes, Clayton; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Solving Sparse Linear Systems on FPGAs using oneAPI

Siefert, Christopher; Hughes, Clayton; Miller, Nicholas; Olivier, Stephen L.; Voskuilen, Gwendolyn R.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

Minerva: Rethinking Secure Architectures for the Era of Fabric-Attached Memory Architectures

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Alwadi, Mazen; Wang, Rujia; Mohaisen, David; Hughes, Clayton; Hammond, Simon; Awad, Amro

Fabric-attached memory (FAM) is proposed to enable the seamless integration of directly accessible memory modules attached to the shared system fabric, which will provide future systems with flexible memory integration options, mitigate underutilization, and facilitate data sharing. Recently proposed interconnects, such as Gen-Z and Compute Express Link (CXL), define security, correctness, and performance requirements of fabric-attached devices, including memory. These initiatives are supported by most major system and processor vendors, bringing widespread adoption of FAM-enabled systems one step closer to reality and security concerns to the forefront. This paper discusses the challenges for adapting secure memory implementations to FAM-enabled systems for the first time in literature. Specifically, we observe that handling the security metadata used to protect fabric-attached memories needs to be done deliberately to eliminate unintentional integrity check failures and/or security vulnerabilities, caused by an inconsistent view of the shared security metadata across nodes. Our scheme, Minerva, elegantly adapts secure memory implementations to support FAM-enabled systems with negligible performance over-heads (3.8% of an ideal scheme), compared to the performance overhead (99.5% of an ideal scheme) for a scheme that uses conventional invalidation-based cache coherence to ensure the consistency of security metadata across nodes.

More Details

TYPE Conference Proceeding YEAR 2022

DOI OSTI Scopus

'Smarter' NICs for faster molecular dynamics: a case study

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Karamati, Sara; Hughes, Clayton; Hemmert, Karl S.; Grant, Ryan E.; Schonbein, William W.; Levy, Scott L.N.; Conte, Thomas M.; Young, Jeffrey; Buduc, Richard W.

This work evaluates the benefits of using a 'smart' network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standard Intel server host using microbenchmarks and MiniMD. In MiniMD, we identify two distinct classes of computation, namely core computation and maintenance computation, which are executed in sequence. We restructure the algorithm and code to weaken this dependence and increase task parallelism, thereby making it possible to increase utilization of the BlueField-2 concurrently with the host. We evaluate our implementation on a cluster consisting of 16 dual-socket Intel Broadwell host nodes with one BlueField-2 per host-node. Our results show that while the overall compute performance of BlueField-2 is limited, using them with a modified MiniMD algorithm allows for up to 20% speedup over the host CPU baseline with no loss in simulation accuracy.

More Details

TYPE Conference Proceeding YEAR 2022

DOI OSTI Scopus

ARIAA Update -- SST

Hughes, Clayton; Ashraf, Rizwan; Gioiosa, Roberto; Phillips, Cynthia A.; Berry, Jonathan; Hart, William E.; Laird, Carl; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Computational Offload with BlueField Smart NICs

Karamati, Sara; Young, Jeffrey; Conte, Tom; Hemmert, Karl S.; Grant, Ryan; Hughes, Clayton; Vuduc, Rich

The recent introduction of a new generation of "smart NICs" have provided new accelerator platforms that include CPU cores or reconfigurable fabric in addition to traditional networking hardware and packet offloading capabilities. While there are currently several proposals for using these smartNICs for low-latency, in-line packet processing operations, there remains a gap in knowledge as to how they might be used as computational accelerators for traditional high-performance applications. This work aims to look at benchmarks and mini-applications to evaluate possible benefits of using a smartNIC as a compute accelerator for HPC applications. We investigate NVIDIA's current-generation BlueField-2 card, which includes eight Arm CPUs along with a small amount of storage, and we test the networking and data movement performance of these cards compared to a standard Intel server host. We then detail how two different applications, YASK and miniMD can be modified to make more efficient use of the BlueField-2 device with a focus on overlapping computation and communication for operations like neighbor building and halo exchanges. Our results show that while the overall compute performance of these devices is limited, using them with a modified miniMD algorithm allows for potential speedups of 5 to 20% over the host CPU baseline with no loss in simulation accuracy.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

A-SST Initial Specification

Rodrigues, Arun; Hammond, Simon; Hemmert, Karl S.; Hughes, Clayton; Kenny, Joseph; Voskuilen, Gwendolyn R.

The U.S. Army Research Office (ARO), in partnership with IARPA, are investigating innovative, efficient, and scalable computer architectures that are capable of executing next-generation large scale data-analytic applications. These applications are increasingly sparse, unstructured, non-local, and heterogeneous. Under the Advanced Graphic Intelligence Logical computing Environment (AGILE) program, Performer teams will be asked to design computer architectures to meet the future needs of the DoD and the Intelligence Community (IC). This design effort will require flexible, scalable, and detailed simulation to assess the performance, efficiency, and validity of their designs. To support AGILE, Sandia National Labs will be providing the AGILE-enhanced Structural Simulation Toolkit (A-SST). This toolkit is a computer architecture simulation framework designed to support fast, parallel, and multi-scale simulation of novel architectures. This document describes the A-SST framework, some of its library of simulation models, and how it may be used by AGILE Performers.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

Towards an Extensible Framework for Accelerated System Simulation

Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hughes, Clayton; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

The ASC Advanced Machine Learning Initiative at Sandia National Laboratories: FY21 Accomplishments and FY22 Plans

Oldfield, Ron; Kramer, S.L.B.; Rushdi, Ahmad; Foulk, James W.; Emery, John M.; Kuberry, Paul; Ray, Jaideep; Ackerman, Sarah; Cyr, Eric C.; Saavedra, Gary; Hughes, Clayton; Cardwell, Suma G.; Smith, J.D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

SST-ExplorerEnabling System-level Performance and Reliability Analysis for Designs with Real-World IPs

Rodrigues, Arun; Awad, Amro; Hughes, Clayton; Agarwal, Sapan; Skoufis, Michael; Voskuilen, Gwendolyn R.; Nema, Shubham; Razdan, Rohin; Gardner, Alan; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

ERAS: Enabling the Integration of Real-World Intellectual Properties (IPs) in Architectural Simulators

Nema, Shubham; Razdan, Rohin; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Adak, Debratim; Hammond, Simon; Awad, Amro; Hughes, Clayton

Sandia National Laboratories is investigating scalable architectural simulation capabilities with a focus on simulating and evaluating highly scalable supercomputers for high performance computing applications. There is a growing demand for RTL model integration to provide the capability to simulate customized node architectures and heterogeneous systems. This report describes the first steps integrating the ESSENTial Signal Simulation Enabled by Netlist Transforms (ESSENT) tool with the Structural Simulation Toolkit (SST). ESSENT can emit C++ models from models written in FIRRTL to automatically generate components. The integration workflow will automatically generate the SST component and necessary interfaces to ’plug’ the ESSENT model into the SST framework.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

ATHENA: Enabling high speed performance estimates for novel hardware design space exploration

Plagge, Mark; Cardwell, Suma G.; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

ATHENA: Enabling high speed performance estimates for novel hardware design space exploration

Plagge, Mark; Cardwell, Suma G.; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

SST-ExplorerEnabling System-level Performance and Reliability Analysis for Designs with Real-World IPs

Rodrigues, Arun; Awad, Amro; Hughes, Clayton; Agarwal, Sapan; Skoufis, Michael; Voskuilen, Gwendolyn R.; Nema, Shubham; Razdan, Rohin; Gardner, Alan; Hemmert, Karl S.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

HPC Architectures Beyond Exascale

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

SST -- llyr SpMM

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Comparing Intel Compilers Poster

Miller, Nicholas; Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Evaluation of oneAPI for FPGAs

Miller, Nicholas; Cook, Jeanine; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

Evaluation of oneAPI for FPGAs

Miller, Nicholas; Cook, Jeanine; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

SST - llyr Development Update

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Using MLIR Framework for Codesign of ML Architectures Algorithms and Simulation Tools

Lewis, Cannada; Hughes, Clayton; Hammond, Simon; Rajamanickam, Sivasankaran

MLIR (Multi-Level Intermediate Representation), is an extensible compiler framework that supports high-level data structures and operation constructs. These higher-level code representations are particularly applicable to the artificial intelligence and machine learning (AI/ML) domain, allowing developers to more easily support upcoming heterogeneous AI/ML accelerators and develop flexible domain specific compilers/frameworks with higher-level intermediate representations (IRs) and advanced compiler optimizations. The result of using MLIR within the LLVM compiler framework is expected to yield significant improvement in the quality of generated machine code, which in turn will result in improved performance and hardware efficiency

More Details

TYPE Other Report YEAR 2021

DOI OSTI

DeACT: Architecture-Aware Virtual Memory Support for Fabric Attached Memory Systems

Proceedings - International Symposium on High-Performance Computer Architecture

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

1 The exponential growth of data has driven technology providers to develop new protocols, such as cache coherent interconnects and memory semantic fabrics, to help users and facilities leverage advances in memory technologies to satisfy these growing memory and storage demands. Using these new protocols, fabric-Attached memories (FAM) can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. One promising use-case for FAMs is in High-Performance Compute (HPC) systems, where the underutilization of memory is a major challenge. However, adopting FAMs in HPC systems brings new challenges. In addition to cost, flexibility, and efficiency, one particular problem that requires rethinking is virtual memory support for security and performance. To address these challenges, this paper presents decoupled access control and address translation (DeACT), a novel virtual memory implementation that supports HPC systems equipped with FAM. Compared to the state-of-The-Art two-level translation approach, DeACT achieves speedup of up to 4.59x (1.8x on average) without compromising security.1Part of this work was done when Vamsee was working under the supervision of Amro Awad at UCF. Amro Awad is now with the ECE Department at NC State.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

SST-GPU: A Scalable SST GPU Component for Performance Modeling and Profiling

Hughes, Clayton; Hammond, Simon; Zhang, Mengchi; Liu, Yechen; Rogers, Tim; Hoekstra, Robert J.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of unprecedented amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

Stealth-Persist: Architectural Support for Persistent Applications in Hybrid Memory Systems

Alwad, Mazen; Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI

miniAMR port using oneAPI

Miller, Nicholas; Hughes, Clayton; Cook, Jeanine

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Evaluation of oneAPI for FPGAs

Miller, Nicholas; Hughes, Clayton; Cook, Jeanine

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

PreFAM: Understanding the Impact of Prefetching in Fabric-Attached Memory Architectures

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

With many recent advances in interconnect technologies and memory interfaces, disaggregated memory systems are approaching industrial adoption. For instance, the recent Gen-Z consortium focuses on a new memory semantic protocol that enables fabric-attached memories (FAM), where the memory and other compute units can be directly attached to fabric interconnects. Decoupling of memory from compute units becomes a feasible option as the rate of data transfer increases due to the emergence of novel interconnect technologies, such as Silicon Photonic Interconnects. Disaggregated memories not only enable more efficient use of capacity (minimizes under-utilization) they also allow easy integration of evolving technologies. Additionally, they simplify the programming model at the same time allowing efficient sharing of data. However, the latency of accessing the data in these Fabric Attached disaggregated Memories (FAMs) is dependent on the latency imposed by the fabric interfaces. To reduce memory access latency and to improve the performance of FAM systems, in this paper, we explore techniques to prefetch data from FAMs to the local memory present in the node (PreFAM). We realize that since the memory access latency is high in FAMs, prefetching a cache block (64 bytes) from FAM can be inefficient, since the possibility of issuing demand requests before the completion of prefetch requests, to the same FAM locations, is high. Hence, we explore predicting and prefetching FAM blocks at a distance; prefetching blocks which are going to be accessed in future but not immediately. We show that, with prefetching, the performance of FAM architectures increases by 38.84%, while memory access latency is improved by 39.6%, with only 17.65% increase in the number of accesses to the FAM, on average. Further, by prefetching at a distance we show a performance improvement of 72.23%.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

SST Paths for ARIAA

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Vanguard II Technical Advisory Team (TAT)

Hughes, Clayton; Hammond, Simon; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

CDFG Extraction Tool for LLVM

Hughes, Clayton; Hammond, Simon; Hoekstra, Robert J.

With the dawn of the exascale era, computer scientists and engineers are faced with tremendous challenges across all facets of the HPC system - scalability, performance, reliability, and power consumption. In particular, the power-performance benefit from one processor generation to the next is seeing ever-diminishing returns and will require fundamental changes in the way we approach computation. In fact, it is likely that different applications will require different types of accelerators in order to meet power, performance, and reliability requirements at scale. One potential type of accelerator, a dataflow architecture, diverges from the traditional sequentially executed instruction model into one that reflects the inherent instruction-level parallelism in a program. This work presents the initial steps toward a tool that can extract the control-dataflow graph from an application.

More Details

TYPE SAND Report YEAR 2020

DOI OSTI

Artificial Intelligence Focused Architectures and Algorithms (ARIAA) Kickoff Meeting

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

The Structural Simulation Toolkit

Hughes, Clayton

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

The Structural Simulation Toolkit and GPGPU-Sim

Hughes, Clayton; Voskuilen, Gwendolyn R.; Zhang, Mengchi; Rogers, Tim

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

The Structural Simulation Toolkit

Hughes, Clayton; Voskuilen, Gwendolyn R.; Zhang, Mengchi; Rogers, Tim

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

GPGPU-Sim Overview

Hughes, Clayton; Green, Roland; Voskuilen, Gwendolyn R.; Zhang, Mengchi; Rogers, Tim

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

The Structural Simulation Toolkit and GPGPU-Sim

Hughes, Clayton; Green, Roland; Voskuilen, Gwendolyn R.; Zhang, Mengchi; Rogers, Tim

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

The Structural Simulation Toolkit

Hughes, Clayton; Green, Roland; Voskuilen, Gwendolyn R.; Zhang, Mengchi; Rogers, Tim

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Page migration support for disaggregated non-volatile memories

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Hammond, Simon; Hughes, Clayton; Samih, Ahmad; Awad, Amro

As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, since HPC systems run many applications with different capacity demands, a large percentage of the overall memory capacity will likely be underutilized; memory modules can be thought of as private memory for its corresponding computing node. Thus, as HPC systems are moving towards the exascale era, a better utilization of memory is strongly desired. Moreover, upgrading memory system requires significant efforts. Fortunately, disaggregated memory systems promise better utilization by defining regions of global memory, typically referred to as memory blades, which can be accessed by all computing nodes in the system, thus achieving much better utilization. Disaggregated memory systems are expected to be built using dense, power-efficient memory technologies. Thus, emerging nonvolatile memories (NVMs) are placing themselves as the main building blocks for such systems. However, NVMs are slower than DRAM. Therefore, it is expected that each computing node would have a small local memory that is based on either HBM or DRAM, whereas a large shared NVM memory would be accessible by all nodes. Managing such system with global and local memory requires a novel hardware/software co-design to initiate page migration between global and local memory to maximize performance while enabling access to huge shared memory. In this paper we provide support to migrate pages, investigate such memory management aspects and the major system-level aspects that can affect design decisions in disaggregated NVM systems

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Balar: A SST GPU Component for Performance Modeling and Profiling

Hughes, Clayton; Hammond, Simon; Khairy, Mahmoud; Zhang, Mengchi; Green, Roland; Rogers, Timothy; Hoekstra, Robert J.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

The Structural Simulation Toolkit and GPGPU-Sim

Voskuilen, Gwendolyn R.; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SST Tutorial - Part 03

Hughes, Clayton; Voskuilen, Gwendolyn R.; Zhang, Mengchi

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

SST Tutorial - Part 02

Zhang, Mengchi; Hughes, Clayton; Voskuilen, Gwendolyn R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ASC CSSE Milestone 6812: SST-GPGPU

Hughes, Clayton; Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Investigating Fairness in Disaggregated Non-Volatile Memories

Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

Many applications have growing demands for memory, particularly in the HPC space, making the memory system a potential bottleneck of next-generation computing systems. Sharing the memory system across processor sockets and nodes becomes a compelling argument given that memory technology is scaling at a slower rate than processor technology. Moreover, as many applications rely on shared data, e.g., graph applications and database workloads, having a large number of nodes accessing shared memory allows for efficient use of resources and avoids duplicating huge files, which can be infeasible for large graphs or scientific data. As new memory technologies come on the market, the flexibility of upgrading memory and system updates become major a concern, disaggregated memory systems where memory is shared across different computing nodes, e.g., System-on-Chip (SoC), is expected to become the most common design/architecture on memory-centric systems, e.g., The Machine project from HP Labs. However, due to the nature of such systems, different users and applications compete for the available memory bandwidth, which can lead to severe contention due to memory traffic from different SoCs. In this paper, we discuss the contention problem in disaggregated memory systems and suggest mechanisms to ensure memory fairness and enforce QoS. Our simulation results show that employing our proposed QoS techniques can speed up memory response time by up to 55%.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads

Hammond, Simon; Hughes, Clayton; Levenhagen, Michael; Vaughan, Courtenay T.; Younge, Andrew J.; Schwaller, Benjamin; Aguilar, Michael J.; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads

Hammond, Simon; Hughes, Clayton; Levenhagen, Michael; Vaughan, Courtenay T.; Younge, Andrew J.; Schwaller, Benjamin; Aguilar, Michael J.; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads

Hammond, Simon; Hughes, Clayton; Levenhagen, Michael; Vaughan, Courtenay T.; Younge, Andrew J.; Schwaller, Benjamin; Aguilar, Michael J.; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ASC CSSE Level 2 Milestone Briefing: SST-GPU

Hughes, Clayton; Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

ECP HE Node Simulation - SNL

Hughes, Clayton; Rodrigues, Arun; Voskuilen, Gwendolyn R.; Hemmert, Karl S.; Hammond, Simon; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

SST-GPU: An Execution -Driven CUDA Kernel Scheduler and Streaming-Multiprocessor Compute Model

Khairy, Mahmoud; Zhang, Mengchi; Green, Roland; Hammond, Simon; Hoekstra, Robert J.; Rogers, Timothy; Hughes, Clayton

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel acceleration where the context for thousands of concurrent threads are resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. The design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. To address the need for a scalable, decentralized GPU model that can model large GPUs, chiplet-based GPUs and multi-node GPUs, this report details the first steps in integrating the open-source, execution driven GPGPU-Sim into the SST framework. The first stage of this project, creates two elements: a kernel scheduler SST element accepts work from SST CPU models and schedules it to an SM-collection element that performs cycle-by-cycle timing using SSTs Mem Hierarchy to model a flexible memory system.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

Vanguard Astra: A Prototype Petascale Arm Supercomputer

Hughes, Clayton; Foulk, James W.; Foulk, James W.; Hammond, Simon; Younge, Andrew J.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Vanguard Astra: A Prototype Petascale Arm Supercomputer

Hughes, Clayton; Foulk, James W.; Foulk, James W.; Hammond, Simon; Younge, Andrew J.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Sandia ATDM DevOps and Performance Analysis

Hoekstra, Robert J.; Bartlett, Roscoe; Hammond, Simon; Cook, Jeanine; Dinge, Dennis; Frye, Joseph R.; Hughes, Clayton; Lin, Paul T.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Enforcing Fairness in Disaggregated Non-Volatile Memory Systems

Kommareddy, Vamsee R.; Awad, Amro; Hughes, Clayton; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Osiris: A low-cost mechanism to enable restoration of secure non-volatile memories

Proceedings of the Annual International Symposium on Microarchitecture, MICRO

Ye, Mao; Hughes, Clayton; Awad, Amro

With Non-Volatile Memories (NVMs) beginning to enter the mainstream computing market, it is time to consider how to secure NVM-equipped computing systems. Recent Meltdown and Spectre attacks are evidence that security must be intrinsic to computing systems and not added as an afterthought. Processor vendors are taking the first steps and are beginning to build security primitives into commodity processors. One security primitive that is associated with the use of emerging NVMs is memory encryption. Memory encryption, while necessary, is very challenging when used with NVMs because it exacerbates the write endurance problem. Secure architectures use cryptographic metadata that must be persisted and restored to allow secure recovery of data in the event of power-loss. Specifically, encryption counters must be persistent to enable secure and functional recovery of an interrupted system. However, the cost of ensuring and maintaining persistence for these counters can be significant. In this paper, we propose a novel scheme to maintain encryption counters without the need for frequent updates. Our new memory controller design, Osiris, repurposes memory Error-Correction Codes (ECCs) to enable fast restoration and recovery of encryption counters. To evaluate our design, we use Gem5 to run eight memory-intensive workloads selected from SPEC2006 and U.S. Department of Energy (DoE) proxy applications. Compared to a write-Through counter-cache scheme, on average, Osiris can reduce 48.7% of the memory writes (increase lifetime by 1.95x), and reduce the performance overhead from 51.5% (for write-Through) to only 5.8%. Furthermore, without the need for backup battery or extra power-supply hold-up time, Osiris performs better than a battery-backed write-back (5.8% vs. 6.6% overhead) and has less write-Traffic (2.6% vs. 5.9% overhead).

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Exploring Allocation Policies in Disaggregated Non-Volatile Memories

Kommareddy, Vamsee R.; Awad, Amro; Hughes, Clayton; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Opal: A Centralized Memory Manager for Investigating Disaggregated Memory Systems

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

Many modern applications have memory footprints that are increasingly large, driving system memory capacities higher and higher. Moreover, these systems are often organized where the bulk of the memory is collocated with the compute capability, which necessitates the need for message passing APIs to facilitate information sharing between compute nodes. Due to the diversity of applications that must run on High-Performance Computing (HPC) systems, the memory utilization can fluctuate wildly from one application to another. And, because memory is located in the node, maintenance can become problematic because each node must be taken offline and upgraded individually. To address these issues, vendors are exploring disaggregated, memory-centric, systems. In this type of organization, there are discrete nodes, reserved solely for memory, which are shared across many compute nodes. Due to their capacity, low-power, and non-volatility, Non-Volatile Memories (NVMs) are ideal candidates for these memory nodes. This report discusses a new component for the Structural Simulation Toolkit (SST), Opal, that can be used to study the impact of using NVMs in a disaggregated system in terms of performance, security, and memory management. This page intentionally left blank.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

On the Use of Vectorization in Production Engineering Workloads

Vaughan, Courtenay T.; Cook, Jeanine; Benner, Robert E.; Dinge, Dennis; Lin, Paul T.; Hughes, Clayton; Hoekstra, Robert J.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Evaluating the Intel Skylake Xeon Processor for HPC Workloads

Hughes, Clayton; Vaughan, Courtenay T.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

On the Use of Vectorization in Production Engineering Workloads

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Hughes, Clayton; Benner, Robert E.; Cook, Jeanine; Pase, Douglas M.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Evaluating the Intel Skylake Xeon Processor for HPC Workloads

Hammond, Simon; Vaughan, Courtenay T.; Hughes, Clayton

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Structural Simulation Toolkit (SST) Tutorial

Hammond, Simon; Rodrigues, Arun; Voskuilen, Gwendolyn R.; Hemmert, Karl S.; Levenhagen, Michael; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Analyzing Exascale Memory Architectures Using the SST Toolkit

Hughes, Clayton; Awad, Amro; Hammond, Simon; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

SST Simulation Framework (and Complex Memory)

Hammond, Simon; Hughes, Clayton; Awad, Amro; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Levenhagen, Michael; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sandia ATDM Performance Execution Tools & Analysis

Hammond, Simon; Vaughan, Courtenay T.; Dinge, Dennis; Lin, Paul T.; Benner, Robert E.; Hughes, Clayton; Trott, Christian R.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Towards a Scalable Integrated Simulation Framework for Extreme Heterogeneity in High Performance Computing

Hammond, Simon; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Hughes, Clayton; Levenhagen, Michael; Hoekstra, Robert J.; Ang, James A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Performance Analysis for Using Non-Volatile Memory DIMMs: Opportunities and Challenges

Awad, Amro; Hammond, Simon; Hughes, Clayton; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Next Generation Science Applications for the Next Generation of Supercomputing

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Pase, Douglas M.; Cook, Jeanine; Trott, Christian R.; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Next Generation Science Applications for the Next Generation of Supercomputing

Vaughan, Courtenay T.; Hammond, Simon; Dinge, Dennis; Lin, Paul T.; Pase, Douglas M.; Trott, Christian R.; Cook, Jeanine; Hughes, Clayton; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Messier: A Detailed NVM-Based DIMM Model for the SST Simulation Framework

Awad, Amro; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hammond, Simon; Hoekstra, Robert J.; Hughes, Clayton

DRAM technology is the main building block of main memory, however, DRAM scaling is becoming very challenging. The main issues for DRAM scaling are the increasing error rates with each new generation, the geometric and physical constraints of scaling the capacitor part of the DRAM cells, and the high power consumption caused by the continuous need for refreshing cell values. At the same time, emerging Non- Volatile Memory (NVM) technologies, such as Phase-Change Memory (PCM), are emerging as promising replacements for DRAM. NVMs, when compared to current technologies e.g., NAND-based ash, have latencies comparable to DRAM. Additionally, NVMs are non-volatile, which eliminates the need for refresh power and enables persistent memory applications. Finally, NVMs have promising densities and the potential for multi-level cell (MLC) storage.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Publications

Search results