Publications

Results 51–75 of 142

Search results

Jump to search filters

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientific/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the computational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used effciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-affecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric definitions, identified in this research, that can be used as meaningful and potentially actionable indicators of performance-affecting contention between applications. These metrics were verified using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's efforts to succeed in extreme-scale computing.

More Details

Large-scale persistent numerical data source monitoring system experiences

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Brandt, James M.; Gentile, Ann C.; Showerman, M.; Enos, J.; Fullop, J.; Bauer, G.

Issues of High Performance Computer (HPC) system diagnosis, automated system management, and resource-aware computing, are all dependent on high fidelity, system wide, persistent monitoring. Development and deployment of an effective persistent system wide monitoring service at large-scale presents a number of challenges, particularly when collecting data at the granularities needed to resolve features of interest and obtain early indication of significant events on the system. In this paper we provide experiences from our developments on and two-year deployment of our Lightweight Distributed Metric Service (LDMS) monitoring system on NCSA's 27,648 node Blue Waters system. We present monitoring related challenges and issues and their effects on the major functional components of general monitoring infrastructures and deployments: Data Sampling, Data Aggregation, Data Storage, Analysis Support, Operations, and Data Stewardship. Based on these experiences, we providerecommendations for effective development and deployment of HPC monitoring systems.

More Details

Infrastructure for in situ system monitoring and application data analysis

Proceedings of ISAV 2015: 1st International Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Brandt, James M.; Devine, Karen D.; Gentile, Ann C.

We present an architecture for high-performance computers that integrates in situ analysis of hardware and system monitoring data with application-specific data to reduce application runtimes and improve overall platform utilization. Large-scale high-performance computing systems typically use monitoring as a tool unrelated to application execution. Monitoring data flows from sampling points to a centralized off-system machine for storage and post-processing when root-cause analysis is required. Along the way, it may also be used for instantaneous threshold-based error detection. Applications can know their application state and possibly allocated resource state, but typically, they have no insight into globally shared resource state that may affect their execution. By analyzing performance data in situ rather than off-line, we enable applications to make real-time decisions about their resource utilization. We address the particular case of in situ network congestion analysis and its potential to improve task placement and data partitioning. We present several design and analysis considerations.

More Details

New systems, new behaviors, new patterns: Monitoring insights from system standup

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Brandt, James M.; Gentile, Ann C.; Martin, Cindy; Repik, Jason; Taerat, Narate

Disentangling significant and important log messages from those that are routine and unimportant can be a difficult task. Further, on a new system, understanding correlations between significant and possibly new types of messages and conditions that cause them can require significant effort and time. The initial standup of a machine can provide opportunities for investigating the parameter space of events and operations and thus for gaining insight into the events of interest. In particular, failure inducement and investigation of corner case conditions can provide knowledge of system behavior for significant issues that will enable easier diagnosis and mitigation of such issues for when they may actually occur during the platform lifetime. In this work, we describe the testing process and monitoring results from a testbed system in preparation for the ACES Trinity system. We describe how events in the initial standup including changes in configuration and software and corner case testing has provided insights that can inform future monitoring and operating conditions, both of our test systems and the eventual large-scale Trinity system.

More Details
Results 51–75 of 142
Results 51–75 of 142