Center for Computing Research (CCR)

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James E.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Nick T.; Tucker, Tom T.; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

OSTI DOI

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation

Zhang, Yijia Z.; Aksar, Burak A.; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak A.; Zhang, Yijia Z.; Ates, Emre A.; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom T.; Tucker, Nick T.; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak A.; Zhang, Yijia Z.; Ates, Emre A.; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI DOI

Enabling Application and System Data Fusion

Gentile, Ann C.; Brandt, James M.; Cook, Jeanine C.; Hammond, Simon D.; Poliakoff, David Z.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Aksar, Burak; Zhang, Yijia; Ates, Emre; Schwaller, Benjamin S.; Aaziz, Omar R.; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.

More Details

TYPE Conference Proceeding YEAR 2021

Scopus OSTI DOI

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Zhang, Yijia; Aksar, Burak; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.

More Details

TYPE Conference Proceeding YEAR 2021

Scopus OSTI DOI

Attributing Performance Variation from Integrated Application and System Data

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Olivier, Stephen L.; Pedretti, Kevin P.; Tucker, Tom T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

A Machine Learning Approach to Understanding HPC Application Performance Variation

Schwaller, Benjamin S.; Aksar, Burak A.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

AD for Machine Learning Approach to Understanding HPC Application Performance Variation Poster

Aksar, Burak A.; Schwaller, Benjamin S.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin D.; Repik, Jason; Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Other Report YEAR 2019

OSTI DOI

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

More Details

TYPE Journal Article YEAR 2019

Scopus OSTI DOI

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

Application Performance Insights via System Monitoring

Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.; Cook, Jeanine C.; Allan, Benjamin A.; Tucker, Thomas T.; Naksinehaboon, Nichamon N.; Taerat, Narate T.; Cook, Jonathan C.; Aaziz, Omar R.; Ates, Emre A.; Tuncer, Ozan T.; Egele, Manuel E.; Turk, Ata T.; Coskun, Ayse K.; izadpanah, ramin i.; Dechev, Damian D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Taxonimist: Application Detection through Rich Monitoring Data

Ates, Emre; Tuncer, Ozan; Turk, Ata T.; Leung, Vitus J.; Brandt, James M.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Runtime HPC System and Application Performance Assessment and Diagnostics

Brandt, James M.; Gentile, Ann C.; Cook, Jonathan E.; Allan, Benjamin A.; Cook, Jeanine C.; Aaziz, Omar R.; Tucker, Thomas T.; Nichamon, Naksinehaboon N.; Taerat, Narate T.; Ates, Emre A.; Tuncer, Ozan T.; Egele, Manuel E.; Turk, Ata T.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Enhanced Profiling for Kokkos Applications

Hammond, Simon D.; Trott, Christian R.; Ibanez-Granados, Daniel A.; Edwards, Harold C.; Sunderland, Daniel S.; Ellingwood, Nathan D.; Brandt, James M.; Gentile, Ann C.; Cook, Jeanine C.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Continuous Performance Tracking for Kokkos Applications Using LDMS

Brandt, James M.; Hammond, Simon D.; Tucker, Thomas T.; Gentile, Ann C.; Cook, Jeanine C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Detection and Diagnosis of Performance Variations

Tuncer, Ozan; Ates, Emre; Zhang, Yijia Z.; Turk, Ata T.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Taxonomist: Application Detection Through Rich Monitoring Data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ates, Emre; Tuncer, Ozan; Turk, Ata; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users. This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set.

More Details

TYPE Presentation YEAR 2018

Scopus OSTI DOI

Task Placement to Reduce Application Communication Costs

Devine, Karen D.; Brandt, James M.; Deveci, Mehmet D.; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Pedretti, Kevin P.; Rajamanickam, Sivasankaran R.; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II

DeConinck, Adam D.; Nam, Hai A.; Mortin, Dave M.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Michael C.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II (Paper)

DeConinck, Adam D.; Nam, Hai A.; Morton, David P.; Bonnie, Amanda B.; Lueninghoener, Cory L.; Brandt, James M.; Gentile, Ann C.; Pedretti, Kevin P.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon D.; Allan, Benjamin A.; Davis, Mike D.; Repik, Jason

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Diagnosing Performance Variations in HPC Architectures Using Machine Learning

Tuncer, Ozan T.; Ates, Emre A.; Zhang, Yijia Z.; Turk, Ata T.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Diagnosing performance variations in HPC applications using machine learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this paper, we present a novel framework that uses machine learning to automatically diagnose previously encountered performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs. We first convert the collected time series data into statistical features that retain application characteristics to significantly reduce the computational overhead of our technique. We then use machine learning algorithms to learn anomaly characteristics from this historical data and to identify the types of anomalies observed while running applications. We evaluate our framework both on an HPC cluster and on a public cloud, and demonstrate that our approach outperforms current state-of-the-art techniques in detecting anomalies, reaching an F-score over 0.97.

More Details

TYPE Presentation YEAR 2017

Scopus OSTI DOI

Smart HPC Centers: data analysis feedback and response

Brandt, James M.; Gentile, Ann C.; martin, c m.; Allan, Benjamin A.; Devine, Karen D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI