Oldfield, Ron A.; Allan, Benjamin A.; Doutriaux, Charles; Lewis, Katherine; Ahrens, James; Sims, Benjamin; Sweeney, Christine; Banesh, Divya; Wofford, Quincy
A robust data-management infrastructure is a key enabler for National Security Enterprise (NSE) capabilities in artificial intelligence and machine learning. This document describes efforts from a team of researchers at Sandia National Laboratories, Los Alamos National Laboratory, and Livermore National Laboratory to complete ASC Level II milestone #8854 “Assessment of Data-Management Infrastructure Needs for Production use of Advanced Machine learning and Artificial Intelligence.”
Development of job scheduling algorithms, which directly influence High-Performance Computing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. This report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. The provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules with the help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.
Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.
Job scheduling aims to minimize the turnaround time on the submitted jobs while catering to the resource constraints of High Performance Computing (HPC) systems. The challenge with scheduling is that it must honor job requirements and priorities while actual job run times are unknown. Although approaches have been proposed that use classification techniques or machine learning to predict job run times for scheduling purposes, these approaches do not provide a technique for reducing underprediction, which has a negative impact on scheduling quality. A common cause of underprediction is that the distribution of the duration for a job class is multimodal, causing the average job duration to fall below the expected duration of longer jobs. In this work, we propose the Top Percent predictor, which uses a hierarchical classification scheme to provide better accuracy for job run time predictions than the user-requested time. Our predictor addresses multimodal job distributions by making a prediction that is higher than a specified percentage of the observed job run times. We integrate the Top Percent predictor into scheduling algorithms and evaluate the performance using schedule quality metrics found in literature. To accommodate the user policies of HPC systems, we propose priority metrics that account for job flow time, job resource requirements, and job priority. The experiments demonstrate that the Top Percent predictor outperforms the related approaches when evaluated using our proposed priority metrics.
We introduce a new HPC system high-speed network fabric production monitoring tool, the ibnet sampler plugin for LDMS version 4. Large-scale testing of this tool is our work in progress. When deployed appropriately, the ibnet sampler plugin can provide extensive counter data, at frequencies up to 1 Hz. This allows the LDMS monitoring system to be useful for tracking the impact of new network features on production systems. We present preliminary results concerning reliability, performance impact, and usability of the sampler.
This report summarizes a set of figures of merit of interest in monitoring the hardware and hardware usage in a Sandia high performance computing (HPC) center. These figures are computable from high frequency monitoring data and other non-metric data and may aid administrators and customer support personnel in their decision processes. The figures are derived from interviews of the HPC center staff. The figures are in many cases simplistic data reductions, but they are our initial targets in creating dashboards that turn voluminous monitoring data into actionable information. Because simplistic reductions may obscure as well as reveal the situation under study, we also document the necessary 'drill-down' and %60exploration' views needed to make the data better understood quickly. These figures of merit may be compared to dashboarding tools documented by other HPC centers.