Center for Computing Research (CCR)

A Machine Learning Approach to Understanding HPC Application Performance Variation

Schwaller, Benjamin S.; Aksar, Burak A.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

AD for Machine Learning Approach to Understanding HPC Application Performance Variation Poster

Aksar, Burak A.; Schwaller, Benjamin S.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Application Performance Insights via System Monitoring

Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.; Cook, Jeanine C.; Allan, Benjamin A.; Tucker, Thomas T.; Naksinehaboon, Nichamon N.; Taerat, Narate T.; Cook, Jonathan C.; Aaziz, Omar R.; Ates, Emre A.; Tuncer, Ozan T.; Egele, Manuel E.; Turk, Ata T.; Coskun, Ayse K.; izadpanah, ramin i.; Dechev, Damian D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Attributing Performance Variation from Integrated Application and System Data

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Olivier, Stephen L.; Pedretti, Kevin P.; Tucker, Tom T.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Continuous Performance Tracking for Kokkos Applications Using LDMS

Brandt, James M.; Hammond, Simon D.; Tucker, Thomas T.; Gentile, Ann C.; Cook, Jeanine C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI DOI

Demonstrating improved application performance using dynamic monitoring and task mapping

2014 IEEE International Conference on Cluster Computing, CLUSTER 2014

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.; Pedretti, Kevin P.

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Design Installation and Operation of the Vortex ART Platform

Gauntt, Nathan E.; Davis, Kevin D.; Repik, Jason; Brandt, James M.; Gentile, Ann C.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Other Report YEAR 2019

OSTI DOI

Detection and Diagnosis of Performance Variations

Tuncer, Ozan; Ates, Emre; Zhang, Yijia Z.; Turk, Ata T.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Diagnosing performance variations in HPC applications using machine learning

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this paper, we present a novel framework that uses machine learning to automatically diagnose previously encountered performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs. We first convert the collected time series data into statistical features that retain application characteristics to significantly reduce the computational overhead of our technique. We then use machine learning algorithms to learn anomaly characteristics from this historical data and to identify the types of anomalies observed while running applications. We evaluate our framework both on an HPC cluster and on a public cloud, and demonstrate that our approach outperforms current state-of-the-art techniques in detecting anomalies, reaching an F-score over 0.97.

More Details

TYPE Presentation YEAR 2017

Scopus OSTI DOI

Diagnosing Performance Variations in HPC Architectures Using Machine Learning

Tuncer, Ozan T.; Ates, Emre A.; Zhang, Yijia Z.; Turk, Ata T.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak A.; Zhang, Yijia Z.; Ates, Emre A.; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems

Aksar, Burak A.; Zhang, Yijia Z.; Ates, Emre A.; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel E.; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI DOI

Enabling Advanced Operational Analysis Through Multi-subsystem Data Integration on Trinity

Brandt, James M.; DeBonis, David D.; Gentile, Ann C.; Lujan, Jim L.; Martin, Cindy M.; Martinez, David J.; Olivier, Stephen L.; Pedretti, Kevin P.; Taerat, Narate T.; Velarde, Ron V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Enabling Advanced Operational Analysis Through Multi-Subsystem Data Integration on Trinity

Brandt, James M.; DeBonis, David D.; Gentile, Ann C.; Lujan, James L.; Martin, Cindy M.; Martinez, David J.; Olivier, Stephen L.; Pedretti, Kevin P.; Taerat, Narate T.; Velarde, Ron V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Enabling Application and System Data Fusion

Gentile, Ann C.; Brandt, James M.; Cook, Jeanine C.; Hammond, Simon D.; Poliakoff, David Z.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Tucker, Tom

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2021

OSTI DOI

Enhanced Profiling for Kokkos Applications

Hammond, Simon D.; Trott, Christian R.; Ibanez-Granados, Daniel A.; Edwards, Harold C.; Sunderland, Daniel S.; Ellingwood, Nathan D.; Brandt, James M.; Gentile, Ann C.; Cook, Jeanine C.; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Infrastructure for In Situ System Monitoring and Application Data Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Infrastructure for In Situ System Monitoring and Application Data Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine C.; Devine, Karen D.; Elliott, James E.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Nick T.; Tucker, Tom T.; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

TYPE SAND Report YEAR 2021

OSTI DOI

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine C.; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen D.; Elliott, James J.; Gentile, Ann C.; Hammond, Simon D.; Kelley, Brian M.; Lopatina, Lena L.; Moore, Stan G.; Olivier, Stephen L.; Pedretti, Kevin P.; Poliakoff, David Z.; Pawlowski, Roger P.; Regier, Phillip A.; Schmitz, Mark E.; Schwaller, Benjamin S.; Surjadidjaja, Vanessa S.; Swan, Matthew S.; Tucker, Tom T.; Tucker, Nick T.; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

Publications