Publications

Results 1–25 of 171

Job Scheduling for HPC Clusters: Constraint Programming vs. Backfilling Approaches

Goponenko, Alexander V.; Lamar, Kenneth; Dechev, Damian; Allan, Benjamin A.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2024

DOI OSTI

Building LDMS Samplers for Slingshot Switches

Brandt, James M.; Stroup, Kevin D.; Gentile, Ann C.; Lueninghoener, Cory D.; Donato, Evan

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2024

DOI OSTI

DP-HPC: Bringing Differential Privacy to HPC Systems Log Sharing and Analysis

Solorzano, Ana; Basu Roy, Rohan; Tiwari, Devesh; Schwaller, Benjamin; Walton, Sara P.; Brandt, James M.

More Details

TYPE Conference Paper YEAR 2024

OSTI

Multivariate Time Series Clustering for HPC Performance Monitoring

Schwaller, Benjamin; Brandt, James M.

Using DTW for HPC / LDMS data clustering

More Details

TYPE Conference Paper YEAR 2023

OSTI

Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers

Schwaller, Benjamin; Brandt, James M.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2023

OSTI

LDMS Version 4.3+ Basics Tutorial

Walton, Sara P.; Allan, Benjamin A.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2023

DOI OSTI

Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Boito, Francieli; Brandt, James M.; Cardellini, Valeria; Carns, Philip; Ciorba, Florina M.; Egan, Hilary; Eleliemy, Ahmed; Gentile, Ann C.; Gruber, Thomas; Hanson, Jeff; Haus, Utz U.; Huck, Kevin; Ilsche, Thomas; Jakobsche, Thomas; Jones, Terry; Karlsson, Sven; Mueen, Abdullah; Ott, Michael; Patki, Tapasya; Peng, Ivy; Raghavan, Krishnan; Simms, Stephen; Shoga, Kathleen; Showerman, Michael; Tiwari, Devesh; Wilde, Torsten; Yamamoto, Keiji

Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.

More Details

TYPE Conference Paper YEAR 2023

DOI OSTI Scopus

Darshan I/O Runtime Monitoring

Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

AppSysFusion: CoMingling of appropriate data to drive Codesign of Applications, HPC Platforms, and Monitoring, Analysis, and Feedback Infrastructure

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Goponenko, Alexander; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Moving Towards Autonomous HPC Facilities

Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Fallout: A Monitoring Infrastructure Supporting Informed System Acceptance

Brandt, James M.; Showerman, Mike; Roman, Eric; Greenseid, Joe; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems

Aksar, Burak; Sencan, Efe; Schwaller, Benjamin; Aaziz, Omar R.; Kulis, Brian; Coskun, Ayse K.; Leung, Vitus J.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Proceeding YEAR 2022

DOI OSTI

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Proceedings - Symposium on Computer Architecture and High Performance Computing

Goponenko, Alexander V.; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Development of job scheduling algorithms, which directly influence High-Performance Computing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. This report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. The provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules with the help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

LDMS Version 4.3.8 Advanced Tutorial: Part 1

Brandt, James M.; Gentile, Ann C.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

LDMS Version 4.3 Tutorial Part 1: Basics

Brandt, James M.; Gentile, Ann C.; Walton, Sara P.; Allan, Benjamin A.; Tucker, Thomas

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

LDMS Version 4.3.8 Advanced Tutorial: Part 2

Brandt, James M.; Gentile, Ann C.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak; Zhang, Yijia; Ates, Emre; Aaziz, Omar R.; Schwaller, Benjamin; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Lessons From Examining Repetitive Job Behavior and I/O Performance Variability on a Production HPC System Emily Costa Northeastern University, USA Tirthak Patel Northeastern University, USA Benjamin Schwaller

Costa, Emily; Patel, Tirthak; Schwaller, Benjamin; Brandt, James M.; Tiwari, Devesh

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI

Integrating Systems Operations into CoDesign

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Integrating System State and Application Performance Monitoring: Network Contention Impact

Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak; Zhang, Yijia; Ates, Emre; Aaziz, Omar R.; Schwaller, Benjamin; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Enabling Application and System Data Fusion

Gentile, Ann C.; Brandt, James M.; Cook, Jeanine; Hammond, Simon; Poliakoff, David; Schwaller, Benjamin; Surjadidjaja, Vanessa; Tucker, Thomas O.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Results 1–25 of 171

Results 1–25 of 171