Publications Search

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Foulk, James W.; Piccinali, J-G; Repik, Jason J.; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Integrating low-latency analysis into HPC system monitoring

ACM International Conference Proceeding Series

Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, James M.; Gentile, Ann C.; Dechev, Damian

The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. We leverage the flexible communication topology of the monitoring system to enable placement of transformations based on overhead concerns, while still enabling low-latency exposure on node. Our design internally supports and exposes the raw and transformed data uniformly for both node level and off-system consumers. We show the viability of our implementation for a case with production-relevance: run-time determination of the relative per-node files system demands.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.; Andersson, S.; Brandt, James M.; Cardo, N.; Chunduri, S.; Enos, J.; Fields, P.; Gentile, Ann C.; Gerber, R.; Gienger, M.; Greenseid, J.; Greiner, A.; Hadri, B.; He, Y.; Hoppe, D.; Kaila, U.; Kelly, K.; Klein, M.; Kristiansen, A.; Leak, S.; Mason, M.; Foulk, James W.; Piccinali, J-G; Repik, Jason J.; Rogers, J.; Salminen, S.; Showerman, M.; Whitney, C.; Williams, J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

OVIS Update 08/24/18

Brandt, James M.; Tucker, Thomas; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigneiw; Iyer, Ravishankar

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Application Performance Insights via System Monitoring

Brandt, James M.; Gentile, Ann C.; Hammond, Simon; Cook, Jeanine; Allan, Benjamin A.; Tucker, Thomas; Naksinehaboon, Nichamon; Taerat, Narate; Cook, Jeanine; Aaziz, Omar R.; Ates, Emre; Tuncer, Ozan; Egele, Manuel; Turk, Ata; Coskun, Ayse; Izadpanah, Ramin; Dechev, Damian

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Supporting Failure Analysis with Discoverable Annotated Log Datasets

Leak, Stephen; Greiner, Annette; Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Application Performance Insights via System Monitoring

Brandt, James M.; Enos, Jeremy; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin; Piccinali, Jean-Guillaume; Repik, Jason J.; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin; Piccinali, Jean-Guillaume; Repik, Jason J.; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Runtime HPC System and Application Performance Assessment and Diagnostics

Brandt, James M.; Gentile, Ann C.; Cook, Jonathan; Allan, Benjamin A.; Cook, Jeanine; Aaziz, Omar R.; Tucker, Thomas; Nichamon, Naksinehaboon; Taerat, Narate; Ates, Emre; Tuncer, Ozan; Egele, Manuel; Turk, Ata; Coskun, Ayse

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Detection and Diagnosis of Performance Variations

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Continuous Performance Tracking for Kokkos Applications Using LDMS

Brandt, James M.; Hammond, Simon; Tucker, Thomas; Gentile, Ann C.; Cook, Jeanine

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Enhanced Profiling for Kokkos Applications

Hammond, Simon; Trott, Christian R.; Ibanez-Granados, Daniel A.; Edwards, Harold C.; Sunderland, Daniel; Ellingwood, Nathan D.; Brandt, James M.; Gentile, Ann C.; Cook, Jeanine; Hoekstra, Robert J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Taxonomist: Application Detection Through Rich Monitoring Data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ates, Emre; Tuncer, Ozan; Turk, Ata; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users. This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set.

More Details

TYPE Conference Poster YEAR 2018

OSTI Scopus

Live feed Sandia CAPVIZ HPC cluster performance analysis & visualization demonstration

Allan, Benjamin A.; Schmitz, Mark E.; Walsh, Edward J.; Aguilar, Michael J.; Brandt, James M.; Gentile, Ann C.; Ogden, Jeffry B.; Monk, Stephen T.; Noe, John P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Holistic measurement-driven system assessment

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Bauer, Greg; Enos, Jeremy; Showerman, Michael; Kaplan, Larry; Bode, Brett; Greiner, Annette; Bonnie, Amanda; Mason, Mike; Iyer, Ravishankar K.; Kramer, William

In high-performance computing systems, application performance and throughput are dependent on a complex interplay of hardware and software subsystems and variable workloads with competing resource demands. Data-driven insights into the potentially widespread scope and propagationof impact of events, such as faults and contention for shared resources, can be used to drive more effective use of resources, for improved root cause diagnosis, and for predicting performance impacts. We present work developing integrated capabilities for holistic monitoring and analysis to understand and characterize propagation of performance-degrading events. These characterizations can be used to determine and invoke mitigating responses by system administrators, applications, and system software.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Task Placement to Reduce Application Communication Costs

Devine, Karen; Brandt, James M.; Deveci, Mehmet; Gentile, Ann C.; Leung, Vitus J.; Olivier, Stephen L.; Foulk, James W.; Rajamanickam, Sivasankaran; Taylor, Mark A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Holistic Measurement Driven System Assessment

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Karlbarczyk, Zbigniew; Bauer, Greg; Enos, Jeremy; Showerman, Michael; Kaplan, Larry; Bode, Brett; Greiner, Annette; Bonnie, Amanda; Mason, Mike; Iyer, Ravishankar; Kramer, William

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Discovering Metrics of Network Contention

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Runtime collection and analysis of system metrics for production monitoring of Trinity Phase II (Paper)

Deconinck, Adam; Nam, Hai A.; Morton, David; Bonnie, Amanda; Lueninghoener, Cory; Brandt, James M.; Gentile, Ann C.; Foulk, James W.; Agelastos, Anthony M.; Vaughan, Courtenay T.; Hammond, Simon; Allan, Benjamin A.; Davis, Mike; Repik, Jason J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo

Formicola, Valerio; Jha, Saurabh; Chen, Daniel; Dong, Wen; Bonnie, Amanda; Mason, Mike; Brandt, James M.; Gentile, Ann C.; Kaplan, Larry; Repik, Jason J.; Enos, Jeremy; Showerman, Mike; Greiner, Annette; Kalbarczyk, Zbigniew; Iyer, Ravishankar; Kramer, Bill

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI