Publications

Results 1–50 of 144

Search results

Jump to search filters

Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Boito, Francieli; Brandt, James M.; Cardellini, Valeria; Carns, Philip; Ciorba, Florina M.; Egan, Hilary; Eleliemy, Ahmed; Gentile, Ann C.; Gruber, Thomas; Hanson, Jeff; Haus, Utz U.; Huck, Kevin; Ilsche, Thomas; Jakobsche, Thomas; Jones, Terry; Karlsson, Sven; Mueen, Abdullah; Ott, Michael; Patki, Tapasya; Peng, Ivy; Raghavan, Krishnan; Simms, Stephen; Shoga, Kathleen; Showerman, Michael; Tiwari, Devesh; Wilde, Torsten; Yamamoto, Keiji

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

ALAMO: Autonomous lightweight allocation, management, and optimization

Communications in Computer and Information Science

Brightwell, Ronald B.; Ferreira, Kurt; Grant, Ryan; Levy, Scott L.N.; Lofstead, Gerald F.; Olivier, Stephen L.; Foulk, James W.; Younge, Andrew J.; Gentile, Ann C.; Foulk, James W.

Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.

More Details

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin; Repik, Jason J.; Brandt, James M.; Gentile, Ann C.; Hammond, Simon

ATS platforms are some of the largest, most complex, and most expensive computer systems installed in the United States at just a few major national laboratories. This milestone describes our recent efforts to procure, install, and test a machine called Vortex at Sandia National Laboratories that is compatible with the larger ATS platform Sierra at LLNL. In this milestone, we have 1) configured and procured a machine with similar hardware characteristics as Sierra ATS, 2) installed the machine, verified its physical hardware, and measured its baseline performance, and 3) demonstrated the machine's compatibility with Sierra ATS, and capacity for useful development and testing of Sandia computer codes (such as SPARC), including uses such as nightly regression testing workloads.

More Details

A study of network congestion in two supercomputing high-speed interconnects

Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019

Jha, Saurabh; Patke, Archit; Brandt, James M.; Gentile, Ann C.; Showerman, Mike; Roman, Eric; Kalbarczyk, Zbigniew T.; Kramer, Bill; Iyer, Ravishankar K.

Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

More Details

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Foulk, James W.; Piccinali, J-G; Repik, Jason J.; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

Integrating low-latency analysis into HPC system monitoring

ACM International Conference Proceeding Series

Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, James M.; Gentile, Ann C.; Dechev, Damian

The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. We leverage the flexible communication topology of the monitoring system to enable placement of transformations based on overhead concerns, while still enabling low-latency exposure on node. Our design internally supports and exposes the raw and transformed data uniformly for both node level and off-system consumers. We show the viability of our implementation for a case with production-relevance: run-time determination of the relative per-node files system demands.

More Details

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Foulk, James W.; Piccinali, J-G; Repik, Jason J.; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin; Piccinali, Jean-Guillaume; Repik, Jason J.; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Abstract not provided.

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin; Piccinali, Jean-Guillaume; Repik, Jason J.; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Abstract not provided.

Holistic measurement-driven system assessment

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Bauer, Greg; Enos, Jeremy; Showerman, Michael; Kaplan, Larry; Bode, Brett; Greiner, Annette; Bonnie, Amanda; Mason, Mike; Iyer, Ravishankar K.; Kramer, William

In high-performance computing systems, application performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven insights into the potentially widespread scope and propagationof impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characterizations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

More Details

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron; Lang, Mike; Martin, Steve

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details
Results 1–50 of 144
Results 1–50 of 144