Publications

Results 26–50 of 157
Skip to search filters

A study of network congestion in two supercomputing high-speed interconnects

Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019

Jha, Saurabh; Patke, Archit; Brandt, James M.; Gentile, Ann C.; Showerman, Mike; Roman, Eric; Kalbarczyk, Zbigniew T.; Kramer, Bill; Iyer, Ravishankar K.

Network congestion in high-speed interconnects is a major source of application runtime performance variation. Recent years have witnessed a surge of interest from both academia and industry in the development of novel approaches for congestion control at the network level and in application placement, mapping, and scheduling at the system-level. However, these studies are based on proxy applications and benchmarks that are not representative of field-congestion characteristics of high-speed interconnects. To address this gap, we present (a) an end-to-end framework for monitoring and analysis to support long-term field-congestion characterization studies, and (b) an empirical study of network congestion in petascale systems across two different interconnect technologies: (i) Cray Gemini, which uses a 3-D torus topology, and (ii) Cray Aries, which uses the DragonFly topology.

More Details

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

More Details

Production application performance data streaming for system monitoring

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Izadpanah, Ramin; Allan, Benjamin A.; Dechev, Damian; Brandt, James M.

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

More Details

Large-Scale System Monitoring Experiences and Recommendations

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Ahlgren, Ville; Andersson, Stefan; Brandt, James M.; Cardo, Nicholas; Chunduri, Sudheer; Enos, Jeremy; Fields, Parks; Gentile, Ann C.; Gerber, Richard; Gienger, Michael; Greenseid, Joe; Greiner, Annette; Hadri, Bilel; He, Yun; Hoppe, Dennis; Kaila, Urpo; Kelly, Kaki; Klein, Mark; Kristiansen, Alex; Leak, Steve; Mason, Mike; Pedretti, Kevin P.; Piccinali, Jean G.; Repik, Jason; Rogers, Jim; Salminen, Susanna; Showerman, Mike; Whitney, Cary; Williams, Jim

Monitoring of High Performance Computing (HPC) platforms is critical to successful operations, can provide insights into performance-impacting conditions, and can inform methodologies for improving science throughput. However, monitoring systems are not generally considered core capabilities in system requirements specifications nor in vendor development strategies. In this paper we present work performed at a number of large-scale HPC sites towards developing monitoring capabilities that fill current gaps in ease of problem identification and root cause discovery. We also present our collective views, based on the experiences presented, on needs and requirements for enabling development by vendors or users of effective sharable end-to-end monitoring capabilities.

More Details

Characterizing Supercomputer Traffic Networks Through Link-Level Analysis

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Jha, Saurabh; Brandt, James M.; Gentile, Ann C.; Kalbarczyk, Zbigniew; Iyer, Ravishankar

More Details

Large-Scale System Monitoring Experiences and Recommendations

Ahlgren, V.A.; Andersson, S.A.; Brandt, James M.; Cardo, N.C.; Chunduri, S.C.; Enos, J.E.; Fields, P.F.; Gentile, Ann C.; Gerber, R.B.; Gienger, M.G.; Greenseid, J.G.; Greiner, A.G.; Hadri, B.H.; He, Y.H.; Hoppe, D.H.; Kaila, U.K.; Kelly, K.K.; Klein, M.K.; Kristiansen, A.K.; Leak, S.L.; Mason, M.M.; Pedretti, Kevin P.; Piccinali, J-G.P.; Repik, Jason; Rogers, J.R.; Salminen, S.S.; showerman, m.s.; Whitney, C.W.; Williams, J.W.

Abstract not provided.

Integrating low-latency analysis into HPC system monitoring

ACM International Conference Proceeding Series

Izadpanah, Ramin; Naksinehaboon, Nichamon; Brandt, James M.; Gentile, Ann C.; Dechev, Damian

The growth of High Performance Computer (HPC) systems increases the complexity with respect to understanding resource utilization, system management, and performance issues. While raw performance data is increasingly exposed at the component level, the usefulness of the data is dependent on the ability to do meaningful analysis on actionable timescales. However, current system monitoring infrastructures largely focus on data collection, with analysis performed off-system in post-processing mode. This increases the time required to provide analysis and feedback to a variety of consumers. In this work, we enhance the architecture of a monitoring system used on large-scale computational platforms, to integrate streaming analysis capabilities at arbitrary locations within its data collection, transport, and aggregation facilities. We leverage the flexible communication topology of the monitoring system to enable placement of transformations based on overhead concerns, while still enabling low-latency exposure on node. Our design internally supports and exposes the raw and transformed data uniformly for both node level and off-system consumers. We show the viability of our implementation for a case with production-relevance: run-time determination of the relative per-node files system demands.

More Details

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

Cray System Monitoring: Successes Requirements and Priorities

Ahlgren, Ville A.; Andersson, Stefan A.; Brandt, James M.; Cardo, Nicholas C.; Chunduri, Sudheer C.; Enos, Jeremy E.; Fields, Parks F.; Gentile, Ann C.; Gerber, Richard G.; Greenseid, Joe G.; Greiner, Annette G.; Hadri, Bilel H.; He, Yun H.; Hoppe, Dennis H.; Kaila, Urpo K.; Kelly, Kaki K.; Klein, Mark K.; Kristiansen, Alex K.; Leak, Steve L.; Mason, Mike M.; Pedretti, Kevin P.; Piccinali, Jean-Guillaume P.; Repik, Jason; Rogers, Jim R.; Salminen, Susanna S.; Showerman, Mike S.; Whitney, Cary W.; Williams, Jim W.

Abstract not provided.

Results 26–50 of 157
Results 26–50 of 157