Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Zhang, Yijia; Aksar, Burak; Aaziz, Omar R.; Schwaller, Benjamin; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI Scopus

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Aksar, Burak; Zhang, Yijia; Ates, Emre; Schwaller, Benjamin; Aaziz, Omar R.; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.

More Details

TYPE Conference Proceeding YEAR 2021

DOI OSTI Scopus

Backfilling HPC Jobs with a Multimodal-Aware Predictor

Proceedings - IEEE International Conference on Cluster Computing, ICCC

HPC Monitoring & Analysis + Power 9 Specifics

Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Lightweight Distributed Metric Service: Deployments Enhancements Roadmap and Activities

Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Taxonomist: Application Detection through Rich Monitoring Data

Ates, Emre; Tuncer, Ozan; Turk, Ata; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

DOI OSTI

Exploring New Monitoring and Analysis Capabilities on Cray?s Software Preview System

Brandt, James M.; Brown, Connor J.; Foulk, James W.; Gentile, Ann C.; Greenseid, Joe; Kramer, William; Langer, Patti; Rashid, Aamir; Rhem, Kevin; Showerman, Michael

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Kramer, Bill; Bauer, Greg; Bode, Brett; Showerman, Mike; Enos, Jeremy; Saxton, Aaron; Jha, Saurabh; Kalbarczyk, Zbigniew; Iyer, Ravi; Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Application and System Performance Metrics

Gentile, Ann C.; Brandt, James M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Foulk, James W.; Piccinali, J-G; Repik, Jason J.; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Integrating low-latency analysis into HPC system monitoring

ACM International Conference Proceeding Series

Live feed Sandia CAPVIZ HPC cluster performance analysis & visualization demonstration

Allan, Benjamin A.; Schmitz, Mark E.; Walsh, Edward J.; Aguilar, Michael J.; Brandt, James M.; Gentile, Ann C.; Ogden, Jeffry B.; Monk, Stephen T.; Noe, John P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Holistic measurement-driven system assessment

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Defining Metrics to Distill Large-Scale HPC Platform and Application Performance Data into Actionable Quantities ? Resource Contention of File System and Aries Interconnect

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Discovery interpretation and communication of meaningful information in HPC monitoring data

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Proceedings of the International Conference on Dependable Systems and Networks

Pratt, Thomas J.; Tarman, Thomas D.; Martinez, Luis M.; Miller, Marc M.; Adams, Roger L.; Chen, Helen Y.; Brandt, James M.; Wyckoff, Peter S.

This document highlights the Discom{sup 2}'s Distance computing and communication team activities at the 1999 Supercomputing conference in Portland, Oregon. This conference is sponsored by the IEEE and ACM. Sandia, Lawrence Livermore and Los Alamos National laboratories have participated in this conference for eleven years. For the last four years the three laboratories have come together at the conference under the DOE's ASCI, Accelerated Strategic Computing Initiatives rubric. Communication support for the ASCI exhibit is provided by the ASCI DISCOM{sup 2} project. The DISCOM{sup 2} communication team uses this forum to demonstrate and focus communication and networking developments within the community. At SC 99, DISCOM built a prototype of the next generation ASCI network demonstrated remote clustering techniques, demonstrated the capabilities of the emerging Terabit Routers products, demonstrated the latest technologies for delivering visualization data to the scientific users, and demonstrated the latest in encryption methods including IP VPN technologies and ATM encryption research. The authors also coordinated the other production networking activities within the booth and between their demonstration partners on the exhibit floor. This paper documents those accomplishments, discusses the details of their implementation, and describes how these demonstrations support Sandia's overall strategies in ASCI networking.

More Details

TYPE Report YEAR 2000

DOI OSTI

Publications

Search results