Publications

Results 1–25 of 171

Search results

Jump to search filters

Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Boito, Francieli; Brandt, James M.; Cardellini, Valeria; Carns, Philip; Ciorba, Florina M.; Egan, Hilary; Eleliemy, Ahmed; Gentile, Ann C.; Gruber, Thomas; Hanson, Jeff; Haus, Utz U.; Huck, Kevin; Ilsche, Thomas; Jakobsche, Thomas; Jones, Terry; Karlsson, Sven; Mueen, Abdullah; Ott, Michael; Patki, Tapasya; Peng, Ivy; Raghavan, Krishnan; Simms, Stephen; Shoga, Kathleen; Showerman, Michael; Tiwari, Devesh; Wilde, Torsten; Yamamoto, Keiji

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

More Details

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Proceedings - Symposium on Computer Architecture and High Performance Computing

Goponenko, Alexander V.; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Development of job scheduling algorithms, which directly influence High-Performance Computing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. This report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. The provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules with the help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

Results 1–25 of 171
Results 1–25 of 171