Publications Search

This milestone 1) exercised a broad set of performance profiling and analysis tools, including tools whose development has been promoted by the ASC program; 2) exercised the tools on two different SNL ASC codes, one Sierra code (Sierra/Aria, a C++ codebase) and one RAMSES code (ITS, a Fortran codebase); and 3) exercised the tools on multiple platforms, including the CTS-1 (e.g., Serrano) and ATS-1 Trinity (e.g., Mutrino) platforms. The milestone generated a plethora of strong and weak scaling, trend and profile data for multiple versions and problem cases for each of the two codes. A wealth of experience was gained with the various tools that included identification of problems, an improved understanding of feature sets, enhanced usage documentation, and insights for future tool-development. Results are provided from a large number and variety of performance analysis runs with the target codes, together with instructions for how to make use of the tools with the codes.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Contention and Congestion: Challenges and Approaches to Understanding Application Impact

Gentile, Ann C.; Brandt, James M.; Agelastos, Anthony M.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Defining Metrics to Distill Large-Scale HPC Platform and Application Performance Data into Actionable Quantities ? Resource Contention of File System and Aries Interconnect

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientific/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the computational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used effciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-affecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric definitions, identified in this research, that can be used as meaningful and potentially actionable indicators of performance-affecting contention between applications. These metrics were verified using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's efforts to succeed in extreme-scale computing.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Preliminary Assessment of Tecplot Chorus for Analyzing Ensemble of CTH Simulations

Agelastos, Anthony M.; Stevenson, Joel O.; Attaway, Stephen W.; Peterson, David

The exploration of large parameter spaces in search of problem solution and uncertainty quantifcation produces very large ensembles of data. Processing ensemble data will continue to require more resources as simulation complexity and HPC platform throughput increase. More tools are needed to help provide rapid insight into these data sets to decrease manual processing time by the analyst and to increase knowledge the data can provide. One such tool is Tecplot Chorus, whose strengths are visualizing ensemble metadata and linked images. This report contains the analysis and conclusions from evaluating Tecplot Chorus with an example problem that is relevant to Sandia National Laboratories. This report documents a preliminary evaluation of Tecplot Chorus for analyzing ensemble data from CTH simulations. The project that funded this report and evaluation is also evaluating and guiding development with SNL’s Slycat. Slycat and Tecplot Chorus each have their strengths, weaknesses, and overlapping capabilities. It is quite likely that, as the scale of ensemble data increases, both of these tools (and possibly others) will be needed for different processing goals. This report will focus on Tecplot Chorus and its application to an example ensemble of data supplied by David J. Peterson and John P. Korbin; this example is of a flyer plate impact and weld study henceforth referred to as CTH Impact Example. This evaluation also defines a workflow for analysts that can help reduce the time and resources for processing ensemble data.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Experiences with Sandia National Laboratories HPC applications and MPI performance

Rajan, Mahesh; Doerfler, Douglas W.; Barrett, Richard F.; Stevenson, Joel O.; Agelastos, Anthony M.; Shaw, Ryan; Meyer, Harold E.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

DOI OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

An Introduction to DevSim

Agelastos, Anthony M.; Shaw, Ryan; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Report YEAR 2014

OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Thomas O.

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

Lustre Experience on Cielo

Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Report YEAR 2013

OSTI

7X performance results - final report : ASCI Red vs Red Storm

Ballance, Robert A.; Gardiner, Thomas A.; Haskell, Karen; Noe, John P.; Stevenson, Joel O.

The goal of the 7X performance testing was to assure Sandia National Laboratories, Cray Inc., and the Department of Energy that Red Storm would achieve its performance requirements which were defined as a comparison between ASCI Red and Red Storm. Our approach was to identify one or more problems for each application in the 7X suite, run those problems at multiple processor sizes in the capability computing range, and compare the results between ASCI Red and Red Storm. The first part of this report describes the two computer systems, the applications in the 7X suite, the test problems, and the results of the performance tests on ASCI Red and Red Storm. During the course of the testing on Red Storm, we had the opportunity to run the test problems in both single-core mode and dual-core mode and the second part of this report describes those results. Finally, we reflect on lessons learned in undertaking a major head-to-head benchmark comparison.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

Publications

Search results