Publications Search

Job Scheduling for HPC Clusters: Constraint Programming vs. Backfilling Approaches

Goponenko, Alexander V.; Lamar, Kenneth; Dechev, Damian; Allan, Benjamin A.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2024

DOI OSTI

Assessment of Data-Management Infrastructure Needs for Production Use of Advanced Machine Learning and Artificial Intelligence: Tri-Lab Level II Milestone (8554)

Oldfield, Ron; Allan, Benjamin A.; Doutriaux, Charles; Lewis, Katherine; Ahrens, James; Sims, Benjamin; Sweeney, Christine; Banesh, Divya; Wofford, Quincy

A robust data-management infrastructure is a key enabler for National Security Enterprise (NSE) capabilities in artificial intelligence and machine learning. This document describes efforts from a team of researchers at Sandia National Laboratories, Los Alamos National Laboratory, and Livermore National Laboratory to complete ASC Level II milestone #8854 “Assessment of Data-Management Infrastructure Needs for Production use of Advanced Machine learning and Artificial Intelligence.”

More Details

TYPE SAND Report YEAR 2023

DOI OSTI

William Seawright

Seawright, William J.; Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2023

DOI OSTI

LDMS Version 4.3+ Basics Tutorial

Walton, Sara P.; Allan, Benjamin A.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2023

DOI OSTI

Process Tracking

Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2023

DOI OSTI

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Goponenko, Alexander; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Proceedings - Symposium on Computer Architecture and High Performance Computing

Goponenko, Alexander V.; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Development of job scheduling algorithms, which directly influence High-Performance Computing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. This report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. The provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules with the help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

LDMS Version 4.3 Tutorial Part 1: Basics

Brandt, James M.; Gentile, Ann C.; Walton, Sara P.; Allan, Benjamin A.; Tucker, Thomas

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

DOI OSTI

Broader and simpler LDMS data sampling: lightning update

Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

An LDMS v3 function store configuration example from the NCSA analysis pipeline

Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

LDMS Testing on Target Systems

Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Backfilling HPC Jobs with a Multimodal-Aware Predictor

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Lamar, Kenneth; Goponenko, Alexander; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Job scheduling aims to minimize the turnaround time on the submitted jobs while catering to the resource constraints of High Performance Computing (HPC) systems. The challenge with scheduling is that it must honor job requirements and priorities while actual job run times are unknown. Although approaches have been proposed that use classification techniques or machine learning to predict job run times for scheduling purposes, these approaches do not provide a technique for reducing underprediction, which has a negative impact on scheduling quality. A common cause of underprediction is that the distribution of the duration for a job class is multimodal, causing the average job duration to fall below the expected duration of longer jobs. In this work, we propose the Top Percent predictor, which uses a hierarchical classification scheme to provide better accuracy for job run time predictions than the user-requested time. Our predictor addresses multimodal job distributions by making a prediction that is higher than a specified percentage of the observed job run times. We integrate the Top Percent predictor into scheduling algorithms and evaluate the performance using schedule quality metrics found in literature. To accommodate the user policies of HPC systems, we propose priority metrics that account for job flow time, job resource requirements, and job priority. The experiments demonstrate that the Top Percent predictor outperforms the related approaches when evaluated using our proposed priority metrics.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

LDMS Loadavg Sampler

Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

LDMS Monitoring of EDR InfiniBand Networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Allan, Benjamin A.; Aguilar, Michael J.; Schwaller, Benjamin; Langer, Steven

We introduce a new HPC system high-speed network fabric production monitoring tool, the ibnet sampler plugin for LDMS version 4. Large-scale testing of this tool is our work in progress. When deployed appropriately, the ibnet sampler plugin can provide extensive counter data, at frequencies up to 1 Hz. This allows the LDMS monitoring system to be useful for tracking the impact of new network features on production systems. We present preliminary results concerning reliability, performance impact, and usability of the sampler.

More Details