Publications

Results 1–25 of 142

Search results

Jump to search filters

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Laros, James H.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Laros, James H.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James E.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Laros, James H.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

ALAMO: Autonomous lightweight allocation, management, and optimization

Communications in Computer and Information Science

Brightwell, Ronald B.; Ferreira, Kurt B.; Grant, Ryan E.; Levy, Scott L.; Lofstead, Gerald F.; Olivier, Stephen L.; Laros, James H.; Younge, Andrew J.; Gentile, Ann C.; Laros, James H.

Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.

More Details

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin D.; Repik, Jason; Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.

ATS platforms are some of the largest, most complex, and most expensive computer systems installed in the United States at just a few major national laboratories. This milestone describes our recent efforts to procure, install, and test a machine called Vortex at Sandia National Laboratories that is compatible with the larger ATS platform Sierra at LLNL. In this milestone, we have 1) configured and procured a machine with similar hardware characteristics as Sierra ATS, 2) installed the machine, verified its physical hardware, and measured its baseline performance, and 3) demonstrated the machine's compatibility with Sierra ATS, and capacity for useful development and testing of Sandia computer codes (such as SPARC), including uses such as nightly regression testing workloads.

More Details

A study of network congestion in two supercomputing high-speed interconnects

Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019

Jha, Saurabh; Patke, Archit; Brandt, James M.; Gentile, Ann C.; Showerman, Mike; Roman, Eric; Kalbarczyk, Zbigniew T.; Kramer, Bill; Iyer, Ravishankar K.

Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

More Details

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Laros, James H.; Piccinali, J-G; Repik, Jason; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

Results 1–25 of 142
Results 1–25 of 142