Application of Performance Analysis Tools on SNL ASC Codes
Abstract not provided.
Abstract not provided.
This milestone 1) exercised a broad set of performance profiling and analysis tools, including tools whose development has been promoted by the ASC program; 2) exercised the tools on two different SNL ASC codes, one Sierra code (Sierra/Aria, a C++ codebase) and one RAMSES code (ITS, a Fortran codebase); and 3) exercised the tools on multiple platforms, including the CTS-1 (e.g., Serrano) and ATS-1 Trinity (e.g., Mutrino) platforms. The milestone generated a plethora of strong and weak scaling, trend and profile data for multiple versions and problem cases for each of the two codes. A wealth of experience was gained with the various tools that included identification of problems, an improved understanding of feature sets, enhanced usage documentation, and insights for future tool-development. Results are provided from a large number and variety of performance analysis runs with the target codes, together with instructions for how to make use of the tools with the codes.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Sierra is an engineering mechanics simulation code suite supporting the Nation's Nuclear Weapons mission as well as other customers. It has explicit ties to Sandia National Labs' workfow, including geometry and meshing, design and optimization, and visualization. Dis- tinguishing strengths include "application aware" development, scalability, SQA and V&V, multiple scales, and multi-physics coupling. This document is intended to help new and existing users of Sierra as a user manual and troubleshooting guide.
Abstract not provided.
Abstract not provided.
Parallel Computing
A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.
Sandia has invested heavily in scientific/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the computational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used effciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-affecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric definitions, identified in this research, that can be used as meaningful and potentially actionable indicators of performance-affecting contention between applications. These metrics were verified using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's efforts to succeed in extreme-scale computing.
Abstract not provided.
Abstract not provided.
Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
Application performance data accounting for resource contention and other external influences is highly coveted and extremely difficult to obtain. «Why did my application's performance change from the last time it ran?» is a question shared by application developers, program analysts, and system administrators. The answer to this question impacts nearly all programmatic and R&D efforts related to high-performance computing (HPC). Lightweight, right-fidelity monitoring infrastructures that can gather relevant application and resource performance data across the entire HPC platform can help address this research topic. This short technical paper will formally define an ongoing research effort to define the needed metrics and methods that distill the vast quantities of available data to a minimum set of actionable and interpretable quantities that can be used by application developers, system administrators, production analysts, and HPC platform designers for their respective production and R&D focus areas.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The exploration of large parameter spaces in search of problem solution and uncertainty quantifcation produces very large ensembles of data. Processing ensemble data will continue to require more resources as simulation complexity and HPC platform throughput increase. More tools are needed to help provide rapid insight into these data sets to decrease manual processing time by the analyst and to increase knowledge the data can provide. One such tool is Tecplot Chorus, whose strengths are visualizing ensemble metadata and linked images. This report contains the analysis and conclusions from evaluating Tecplot Chorus with an example problem that is relevant to Sandia National Laboratories. This report documents a preliminary evaluation of Tecplot Chorus for analyzing ensemble data from CTH simulations. The project that funded this report and evaluation is also evaluating and guiding development with SNL’s Slycat. Slycat and Tecplot Chorus each have their strengths, weaknesses, and overlapping capabilities. It is quite likely that, as the scale of ensemble data increases, both of these tools (and possibly others) will be needed for different processing goals. This report will focus on Tecplot Chorus and its application to an example ensemble of data supplied by David J. Peterson and John P. Korbin; this example is of a flyer plate impact and weld study henceforth referred to as CTH Impact Example. This evaluation also defines a workflow for analysts that can help reduce the time and resources for processing ensemble data.
Abstract not provided.
Sierra is an engineering mechanics simulation code suite supporting the Nation's Nuclear Weapons mission as well as other customers. It has explicit ties to Sandia National Labs' workfow, including geometry and meshing, design and optimization, and visualization. Distinguishing strengths include "application aware" development, scalability, SQA and V&V, multiple scales, and multi-physics coupling. This document is intended to help new and existing users of Sierra as a user manual and troubleshooting guide.