Publications Search

As computer systems grow in both size and complexity, the need for applications and run-time systems to adjust to their dynamic environment also grows. The goal of the RAAMP LDRD was to combine static architecture information and real-time system state with algorithms to conserve power, reduce communication costs, and avoid network contention. We devel- oped new data collection and aggregation tools to extract static hardware information (e.g., node/core hierarchy, network routing) as well as real-time performance data (e.g., CPU uti- lization, power consumption, memory bandwidth saturation, percentage of used bandwidth, number of network stalls). We created application interfaces that allowed this data to be used easily by algorithms. Finally, we demonstrated the benefit of integrating system and application information for two use cases. The first used real-time power consumption and memory bandwidth saturation data to throttle concurrency to save power without increasing application execution time. The second used static or real-time network traffic information to reduce or avoid network congestion by remapping MPI tasks to allocated processors. Results from our work are summarized in this report; more details are available in our publications [2, 6, 14, 16, 22, 29, 38, 44, 51, 54].

More Details

TYPE SAND Report YEAR 2014

DOI OSTI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen; Gentile, Ann C.; Bays, Nathan R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

DOI OSTI DOI OSTI

Lightweight Distributed Metric Service (LDMS)

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

SNL-Monitoring-Overview_talk

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Demonstrating Improved Application Performance Using Dynamic Monitoring and Task Mapping

Brandt, James M.; Devine, Karen; Gentile, Ann C.; Bays, Nathan R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

Large Scale System Monitoring and Analysis on Blue Waters using OVIS

Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

DOI OSTI

Large Scale HPC Monitoring

Brandt, James M.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Report YEAR 2014

OSTI

Large Scale System Monitoring and Analysis on Blue Waters using OVIS (presentation)

Brandt, James M.; Gentile, Ann C.; Allan, Benjamin A.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Thomas O.

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus