Publications

Results 1–25 of 69
Skip to search filters

A-SST Initial Specification

Rodrigues, Arun; Hammond, Simon D.; Hemmert, Karl S.; Hughes, Clayton H.; Kenny, Joseph P.; Voskuilen, Gwendolyn R.

The U.S. Army Research Office (ARO), in partnership with IARPA, are investigating innovative, efficient, and scalable computer architectures that are capable of executing next-generation large scale data-analytic applications. These applications are increasingly sparse, unstructured, non-local, and heterogeneous. Under the Advanced Graphic Intelligence Logical computing Environment (AGILE) program, Performer teams will be asked to design computer architectures to meet the future needs of the DoD and the Intelligence Community (IC). This design effort will require flexible, scalable, and detailed simulation to assess the performance, efficiency, and validity of their designs. To support AGILE, Sandia National Labs will be providing the AGILE-enhanced Structural Simulation Toolkit (A-SST). This toolkit is a computer architecture simulation framework designed to support fast, parallel, and multi-scale simulation of novel architectures. This document describes the A-SST framework, some of its library of simulation models, and how it may be used by AGILE Performers.

More Details

ATHENA: Analytical Tool for Heterogeneous Neuromorphic Architectures

Cardwell, Suma G.; Plagge, Mark P.; Hughes, Clayton H.; Rothganger, Fredrick R.; Agarwal, Sapan A.; Feinberg, Benjamin F.; Awad, Amro A.; mcfarland, john m.; Parker, Luke G.

The ASC program seeks to use machine learning to improve efficiencies in its stockpile stewardship mission. Moreover, there is a growing market for technologies dedicated to accelerating AI workloads. Many of these emerging architectures promise to provide savings in energy efficiency, area, and latency when compared to traditional CPUs for these types of applications — neuromorphic analog and digital technologies provide both low-power and configurable acceleration of challenging artificial intelligence (AI) algorithms. If designed into a heterogeneous system with other accelerators and conventional compute nodes, these technologies have the potential to augment the capabilities of traditional High Performance Computing (HPC) platforms [5]. This expanded computation space requires not only a new approach to physics simulation, but the ability to evaluate and analyze next-generation architectures specialized for AI/ML workloads in both traditional HPC and embedded ND applications. Developing this capability will enable ASC to understand how this hardware performs in both HPC and ND environments, improve our ability to port our applications, guide the development of computing hardware, and inform vendor interactions, leading them toward solutions that address ASC’s unique requirements.

More Details

Balar: A SST GPU Component for Performance Modeling and Profiling

Hughes, Clayton H.; Hammond, Simon D.; Khairy, Mahmoud K.; Zhang, Mengchi Z.; Green, Roland G.; Rogers, Timothy R.; Hoekstra, Robert J.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.

More Details

Computational Offload with BlueField Smart NICs

Karamati, Sara K.; Young, Jeffrey L.; Conte, Tom C.; Hemmert, Karl S.; Grant, Ryan G.; Hughes, Clayton H.; Vuduc, Rich V.

The recent introduction of a new generation of "smart NICs" have provided new accelerator platforms that include CPU cores or reconfigurable fabric in addition to traditional networking hardware and packet offloading capabilities. While there are currently several proposals for using these smartNICs for low-latency, in-line packet processing operations, there remains a gap in knowledge as to how they might be used as computational accelerators for traditional high-performance applications. This work aims to look at benchmarks and mini-applications to evaluate possible benefits of using a smartNIC as a compute accelerator for HPC applications. We investigate NVIDIA's current-generation BlueField-2 card, which includes eight Arm CPUs along with a small amount of storage, and we test the networking and data movement performance of these cards compared to a standard Intel server host. We then detail how two different applications, YASK and miniMD can be modified to make more efficient use of the BlueField-2 device with a focus on overlapping computation and communication for operations like neighbor building and halo exchanges. Our results show that while the overall compute performance of these devices is limited, using them with a modified miniMD algorithm allows for potential speedups of 5 to 20% over the host CPU baseline with no loss in simulation accuracy.

More Details

DeACT: Architecture-Aware Virtual Memory Support for Fabric Attached Memory Systems

Proceedings - International Symposium on High-Performance Computer Architecture

Kommareddy, Vamsee R.; Hughes, Clayton H.; Hammond, Simon D.; Awad, Amro

1 The exponential growth of data has driven technology providers to develop new protocols, such as cache coherent interconnects and memory semantic fabrics, to help users and facilities leverage advances in memory technologies to satisfy these growing memory and storage demands. Using these new protocols, fabric-Attached memories (FAM) can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. One promising use-case for FAMs is in High-Performance Compute (HPC) systems, where the underutilization of memory is a major challenge. However, adopting FAMs in HPC systems brings new challenges. In addition to cost, flexibility, and efficiency, one particular problem that requires rethinking is virtual memory support for security and performance. To address these challenges, this paper presents decoupled access control and address translation (DeACT), a novel virtual memory implementation that supports HPC systems equipped with FAM. Compared to the state-of-The-Art two-level translation approach, DeACT achieves speedup of up to 4.59x (1.8x on average) without compromising security.1Part of this work was done when Vamsee was working under the supervision of Amro Awad at UCF. Amro Awad is now with the ECE Department at NC State.

More Details

ERAS: Enabling the Integration of Real-World Intellectual Properties (IPs) in Architectural Simulators

Nema, Shubham N.; Razdan, Rohin R.; Rodrigues, Arun; Hemmert, Karl S.; Voskuilen, Gwendolyn R.; Adak, Debratim A.; Hammond, Simon D.; Awad, Amro A.; Hughes, Clayton H.

Sandia National Laboratories is investigating scalable architectural simulation capabilities with a focus on simulating and evaluating highly scalable supercomputers for high performance computing applications. There is a growing demand for RTL model integration to provide the capability to simulate customized node architectures and heterogeneous systems. This report describes the first steps integrating the ESSENTial Signal Simulation Enabled by Netlist Transforms (ESSENT) tool with the Structural Simulation Toolkit (SST). ESSENT can emit C++ models from models written in FIRRTL to automatically generate components. The integration workflow will automatically generate the SST component and necessary interfaces to ’plug’ the ESSENT model into the SST framework.

More Details

Evaluating the intel skylake xeon processor for HPC workloads

Proceedings - 2018 International Conference on High Performance Computing and Simulation, HPCS 2018

Hammond, Simon D.; Vaughan, Courtenay T.; Hughes, Clayton H.

Despite significant advances in the porting of scientific applications to novel architectures such as compute-optimized graphics processors, many-core processor/accelerators and, even special-purpose function units, the vast majority of scientific calculations are still performed on high-performance, commodity server processors. Even in the cases of applications which have been ported to new architectures, frequent serial sections still require strong server-class processor cores to compute as fast as possible. In this paper we report on a set of benchmark studies which evaluate Intel's latest Skylake Xeon server processor. Skylake represents a significant change in the Xeon product line with wider SIMD vector units, a redesigned cache architecture, and, an increased number of memory channels. The wider vector units provide 2x improvement for some compute-intensive applications and the combined memory changes can provide close to 2x the memory bandwidth. We evaluate these new hardware features on several HPC-relevant mini-Applications and benchmarks, including, STREAM, LULESH, XSBench, HPCG and SW4Lite. Together, the new hardware functions provide up to 1.8x speedup on HPC benchmark codes when compared with the previous generation Haswell processor core, providing much greater utility to a broader range of HPC applications that rely on this class of compute node.

More Details

Evaluating the Marvell ThunderX2 Server Processor for HPC Workloads

2019 International Conference on High Performance Computing and Simulation, HPCS 2019

Hammond, Simon D.; Hughes, Clayton H.; Levenhagen, Michael J.; Vaughan, Courtenay T.; Younge, Andrew J.; Schwaller, Benjamin S.; Aguilar, Michael J.; Pedretti, Kevin P.; Laros, James H.

The high performance computing industry is undergoing a period of substantial change. Not least because of fabrication and lithographic challenges in the manufacturing of next-generation processors. As such challenges mount, the industry is looking to generate higher performance from additional functionality in the micro-architecture space as well as a greater emphasis on efficiency in the design of networkon-chip resources and memory subsystems. Such variation in design opens opportunities for new entrants in the data center and server markets where varying compute-to-memory ratios can present end users with more efficient node designs for particular workloads. In this paper we compare the recently released Marvell ThunderX2 Arm processor - arguably the first high-performance computing capable Arm design available in the marketplace. We perform a set of micro-benchmarking and mini-application evaluation on the ThunderX2 comparing it with Intel's Haswell and Skylake Xeon server parts commonly used in contemporary HPC designs. Our findings show that no one processor performs the best across all benchmarks, but that the ThunderX2 excels in areas demanding high memory bandwidth due to the provisioning of more memory channels in its design. We conclude that the ThunderX2 is a serious contender in the HPC server segment and has the potential to offer supercomputing sites with a viable high-performance alternative to existing designs from established industry players.

More Details
Results 1–25 of 69
Results 1–25 of 69