Publications

Results 26–49 of 49
Skip to search filters

Guide to Using Sierra

Shaw, Ryan P.; Agelastos, Anthony M.; Miller, Joel D.

Sierra is an engineering mechanics simulation code suite supporting the Nation's Nuclear Weapons mission as well as other customers. It has explicit ties to Sandia National Labs' workfow, including geometry and meshing, design and optimization, and visualization. Dis- tinguishing strengths include "application aware" development, scalability, SQA and V&V, multiple scales, and multi-physics coupling. This document is intended to help new and existing users of Sierra as a user manual and troubleshooting guide.

More Details

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientifc/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the com- putational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used efciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-afecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric defnitions, identifed in this research, that can be used as meaningful and poten- tially actionable indicators of performance-afecting contention between applications. These metrics were verifed using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's eforts to succeed in extreme-scale computing.

More Details

Defining metrics to distill large-scale HPC platform and application performance data into actionable quantities

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Agelastos, Anthony M.

Application performance data accounting for resource contention and other external influences is highly coveted and extremely difficult to obtain. «Why did my application's performance change from the last time it ran?» is a question shared by application developers, program analysts, and system administrators. The answer to this question impacts nearly all programmatic and R&D efforts related to high-performance computing (HPC). Lightweight, right-fidelity monitoring infrastructures that can gather relevant application and resource performance data across the entire HPC platform can help address this research topic. This short technical paper will formally define an ongoing research effort to define the needed metrics and methods that distill the vast quantities of available data to a minimum set of actionable and interpretable quantities that can be used by application developers, system administrators, production analysts, and HPC platform designers for their respective production and R&D focus areas.

More Details

Toward rapid understanding of production HPC applications and systems

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia L.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh R.; Stevenson, Joel O.

A detailed understanding of HPC application's resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh R.; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Tom

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

Simulation information regarding Sandia National Laboratories trinity capability improvement metric

Agelastos, Anthony M.; Lin, Paul L.

Sandia National Laboratories, Los Alamos National Laboratory, and Lawrence Livermore National Laboratory each selected a representative simulation code to be used as a performance benchmark for the Trinity Capability Improvement Metric. Sandia selected SIERRA Low Mach Module: Nalu, which is a uid dynamics code that solves many variable-density, acoustically incompressible problems of interest spanning from laminar to turbulent ow regimes, since it is fairly representative of implicit codes that have been developed under ASC. The simulations for this metric were performed on the Cielo Cray XE6 platform during dedicated application time and the chosen case utilized 131,072 Cielo cores to perform a canonical turbulent open jet simulation within an approximately 9-billion-elementunstructured- hexahedral computational mesh. This report will document some of the results from these simulations as well as provide instructions to perform these simulations for comparison.

More Details
Results 26–49 of 49
Results 26–49 of 49