Publications Search

Chronicles of astra: Challenges and lessons from the first petascale arm supercomputer

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Foulk, James W.; Younge, Andrew J.; Hammond, Simon; Foulk, James W.; Curry, Matthew; Aguilar, Michael J.; Hoekstra, Robert J.; Brightwell, Ronald B.

Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Early Experiences with A64FX

Hammond, Simon; Younge, Andrew J.; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

PreFAM: Understanding the Impact of Prefetching in Fabric-Attached Memory Architectures

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

With many recent advances in interconnect technologies and memory interfaces, disaggregated memory systems are approaching industrial adoption. For instance, the recent Gen-Z consortium focuses on a new memory semantic protocol that enables fabric-attached memories (FAM), where the memory and other compute units can be directly attached to fabric interconnects. Decoupling of memory from compute units becomes a feasible option as the rate of data transfer increases due to the emergence of novel interconnect technologies, such as Silicon Photonic Interconnects. Disaggregated memories not only enable more efficient use of capacity (minimizes under-utilization) they also allow easy integration of evolving technologies. Additionally, they simplify the programming model at the same time allowing efficient sharing of data. However, the latency of accessing the data in these Fabric Attached disaggregated Memories (FAMs) is dependent on the latency imposed by the fabric interfaces. To reduce memory access latency and to improve the performance of FAM systems, in this paper, we explore techniques to prefetch data from FAMs to the local memory present in the node (PreFAM). We realize that since the memory access latency is high in FAMs, prefetching a cache block (64 bytes) from FAM can be inefficient, since the possibility of issuing demand requests before the completion of prefetch requests, to the same FAM locations, is high. Hence, we explore predicting and prefetching FAM blocks at a distance; prefetching blocks which are going to be accessed in future but not immediately. We show that, with prefetching, the performance of FAM architectures increases by 38.84%, while memory access latency is improved by 39.6%, with only 17.65% increase in the number of accesses to the FAM, on average. Further, by prefetching at a distance we show a performance improvement of 72.23%.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Vanguard II Technical Advisory Team (TAT)

Hughes, Clayton; Hammond, Simon; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Experiences Scaling a Production Arm Supercomputer to Petaflops and Beyond

Foulk, James W.; Foulk, James W.; Hammond, Simon

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Scalable generation of graphs for benchmarking HPC community-detection algorithms

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Slota, George M.; Berry, Jonathan; Hammond, Simon; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Scalable Generation of Graphs for Benchmarking HPC Community-Detection Algorithms

Slota, George M.; Berry, Jonathan; Hammond, Simon; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Page migration support for disaggregated non-volatile memories

ACM International Conference Proceeding Series

Kommareddy, Vamsee R.; Hammond, Simon; Hughes, Clayton; Samih, Ahmad; Awad, Amro

As demands for memory-intensive applications continue to grow, the memory capacity of each computing node is expected to grow at a similar pace. In high-performance computing (HPC) systems, the memory capacity per compute node is decided upon the most demanding application that would likely run on such system, and hence the average capacity per node in future HPC systems is expected to grow significantly. However, since HPC systems run many applications with different capacity demands, a large percentage of the overall memory capacity will likely be underutilized; memory modules can be thought of as private memory for its corresponding computing node. Thus, as HPC systems are moving towards the exascale era, a better utilization of memory is strongly desired. Moreover, upgrading memory system requires significant efforts. Fortunately, disaggregated memory systems promise better utilization by defining regions of global memory, typically referred to as memory blades, which can be accessed by all computing nodes in the system, thus achieving much better utilization. Disaggregated memory systems are expected to be built using dense, power-efficient memory technologies. Thus, emerging nonvolatile memories (NVMs) are placing themselves as the main building blocks for such systems. However, NVMs are slower than DRAM. Therefore, it is expected that each computing node would have a small local memory that is based on either HBM or DRAM, whereas a large shared NVM memory would be accessible by all nodes. Managing such system with global and local memory requires a novel hardware/software co-design to initiate page migration between global and local memory to maximize performance while enabling access to huge shared memory. In this paper we provide support to migrate pages, investigate such memory management aspects and the major system-level aspects that can affect design decisions in disaggregated NVM systems

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Evaluating the Opportunities for Multi-Level Memory - An ASC 2016 L2 Milestone

Voskuilen, Gwendolyn R.; Frank, Michael P.; Hammond, Simon; Rodrigues, Arun

As new memory technologies appear on the market, there is a growing push to incorporate them into future architectures. Compared to traditional DDR DRAM, these technologies provide appealing advantages such as increased bandwidth or non-volatility. However, the technologies have significant downsides as well including higher cost, manufacturing complexity, and for non-volatile memories, higher latency and wear-out limitations. As such, no technology has emerged as a clear technological and economic winner. As a result, systems are turning to the concept of multi-level memory, or mixing multiple memory technologies in a single system to balance cost, performance, and reliability.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

Abstract Machine Models and Proxy Architectures for Exascale Computing

Ang, James A.; Barrett, Richard F.; Benner, Robert E.; Burke, Daniel; Chan, Cy; Cook, Jeanine; Daley, Christopher S.; Donofrio, David; Hammond, Simon; Hemmert, Karl S.; Hoekstra, Robert J.; Ibrahim, Khaled; Kelly, Suzanne M.; Le, Hoang; Leung, Vitus J.; Michelogiannakis, George; Resnick, David R.; Rodrigues, Arun; Shalf, John; Stark, Dylan; Unat, D.; Wright, Nick J.; Voskuilen, Gwendolyn R.

To achieve exascale computing, fundamental hardware architectures must change. The most significant consequence of this assertion is the impact on the scientific and engineering applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware architects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of models that can help software developers prepare for exascale. In addition, through the use of proxy architectures, we can enable a more concrete exploration of how well new and evolving application codes map onto future architectures. This second version of the document addresses system scale considerations and provides a system-level abstract machine model with proxy architecture information.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

Balar: A SST GPU Component for Performance Modeling and Profiling

Hughes, Clayton; Hammond, Simon; Khairy, Mahmoud; Zhang, Mengchi; Green, Roland; Rogers, Timothy; Hoekstra, Robert J.

Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of massive amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin; Repik, Jason J.; Brandt, James M.; Gentile, Ann C.; Hammond, Simon

ATS platforms are some of the largest, most complex, and most expensive computer systems installed in the United States at just a few major national laboratories. This milestone describes our recent efforts to procure, install, and test a machine called Vortex at Sandia National Laboratories that is compatible with the larger ATS platform Sierra at LLNL. In this milestone, we have 1) configured and procured a machine with similar hardware characteristics as Sierra ATS, 2) installed the machine, verified its physical hardware, and measured its baseline performance, and 3) demonstrated the machine's compatibility with Sierra ATS, and capacity for useful development and testing of Sandia computer codes (such as SPARC), including uses such as nightly regression testing workloads.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

An Update on Astra: Experiences Running the first Petascale ARM Supercomputer

Younge, Andrew J.; Foulk, James W.; Foulk, James W.; Hammond, Simon

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Full-System Modeling and Simulation: Contributions Towards Coupling Contention and I/O

Chester, D.G.; Wright, S.A.; Hammond, Simon; Law, T.; Smedley-Stevenson, R.; Maheswaran, S.; Jarvis, S.A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

ASC CSSE Milestone 6812: SST-GPGPU

Hughes, Clayton; Hammond, Simon; Voskuilen, Gwendolyn R.; Rodrigues, Arun; Hemmert, Karl S.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

NNSA Explorations - Arm for Supercomputing

Hammond, Simon; Pritchard, Howard

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Investigating Fairness in Disaggregated Non-Volatile Memories

Proceedings of IEEE Computer Society Annual Symposium on VLSI, ISVLSI

Kommareddy, Vamsee R.; Hughes, Clayton; Hammond, Simon; Awad, Amro

Many applications have growing demands for memory, particularly in the HPC space, making the memory system a potential bottleneck of next-generation computing systems. Sharing the memory system across processor sockets and nodes becomes a compelling argument given that memory technology is scaling at a slower rate than processor technology. Moreover, as many applications rely on shared data, e.g., graph applications and database workloads, having a large number of nodes accessing shared memory allows for efficient use of resources and avoids duplicating huge files, which can be infeasible for large graphs or scientific data. As new memory technologies come on the market, the flexibility of upgrading memory and system updates become major a concern, disaggregated memory systems where memory is shared across different computing nodes, e.g., System-on-Chip (SoC), is expected to become the most common design/architecture on memory-centric systems, e.g., The Machine project from HP Labs. However, due to the nature of such systems, different users and applications compete for the available memory bandwidth, which can lead to severe contention due to memory traffic from different SoCs. In this paper, we discuss the contention problem in disaggregated memory systems and suggest mechanisms to ensure memory fairness and enforce QoS. Our simulation results show that employing our proposed QoS techniques can speed up memory response time by up to 55%.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus