Publications Search

FY19 ASC L2 Milestone 6813 Place Astra (Vanguard1) in operation and Configured for OHPC and SRN, Advanced Prototype Work. Executive Summary

This milestone was created to ensure the Sandia FOUS program has the needed levels of project direction, programmatic information and an escalation path (if needed) for their role within the deployment and operations of the Astra cluster.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

Vanguard Astra and ATSE – an ARM-based Advanced Architecture Prototype System and Software Environment (FY18 L2 Milestone #8759 Report)

Bays, Nathan R.; Bays, Nathan R.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

FY18 L2 Milestone #6360 Report: Initial Capability of an Arm-based Advanced Architecture Prototype System and Software Environment

Bays, Nathan R.; Bays, Nathan R.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Application of Performance Analysis Tools on SNL ASC Codes

Agelastos, Anthony M.; Pase, Douglas M.; Amspaugh, Kathleen A.; Dinge, Dennis; Haskell, Karen; Ice, Lisa; Lamb, Justin M.; Shaw, Ryan; Stevenson, Joel O.; Brunini, Victor; Clausen, Jonathan; Crawford, Martin J.; Valdez, Greg D.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Preliminary Assessment of Impact from Patch for Meltdown and Spectre (variants 1 & 2) on Sandia National Laboratories? HPC Production Operations Using ASC Integrated Codes

Agelastos, Anthony M.; Pase, Douglas M.; Klitsner, Tom; Monk, Stephen T.; Noe, John P.; Pavlakos, Constantine; Klundt, Ruth A.; Stevenson, Joel O.; Lamb, Justin M.; Ogden, Jeffry B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Live feed Sandia CAPVIZ HPC cluster performance analysis & visualization demonstration

Allan, Benjamin A.; Schmitz, Mark E.; Walsh, Edward J.; Aguilar, Michael J.; Brandt, James M.; Gentile, Ann C.; Ogden, Jeffry B.; Monk, Stephen T.; Noe, John P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

New platform provides innovation

Monk, Stephen T.

Abstract not provided.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Toward Rapid Understanding of Production HPC Applications and Systems

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

DOI OSTI

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Thomas O.

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

Uno: One Cluster - Many Roles

Monk, Stephen T.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Red sky/Red mesa : an innovative, energy efficient supercomputer

Monk, Stephen T.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

SNL capacity file systems

Monk, Stephen T.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Lustre on Red Sky

Monk, Stephen T.

The goals of Lustre on Red Sky are: (1) provide home/projects/scratch Lustre file systems; (2) adhere to the Sun HPC stack; (3) implement software RAID on Sun provided JBODs; and (4) design for easy administration. Conclusions are: (1) software RAID includes additional risks and administration vs. hardware RAID solutions; (2) limited testing of hardware in these configurations make it ill-suited for rapid deployment in a production environment; and (3) Lustre has been a shining star on this machine, Red Sky users are pleased with its performance.

More Details

TYPE Conference YEAR 2010

OSTI

HPC top 10 InfiniBand Machine : a 3D Torus IB interconnect on Red Sky

Naegle, John H.; Monk, Stephen T.; Schutt, James A.; Doerfler, Douglas W.; Rajan, Mahesh

This presentation discusses the following topics: (1) Red Sky Background; (2) 3D Torus Interconnect Concepts; (3) Difficulties of Torus in IB; (4) New Routing Code for IB a 3D Torus; (5) Red Sky 3D Torus Implementation; and (6) Managing a Large IB Machine. Computing at Sandia: (1) Capability Computing - Designed for scaling of single large runs, Usually proprietary for maximum performance, and Red Storm is Sandia's current capability machine; (2) Capacity Computing - Computing for the masses, 100s of jobs and 100s of users, Extreme reliability required, Flexibility for changing workload, Thunderbird will be decommissioned this quarter, Red Sky is our future capacity computing platform, and Red Mesa machine for National Renewable Energy Lab. Red Sky main themes are: (1) Cheaper - 5X capacity of Tbird at 2/3 the cost, Substantially cheaper per flop than our last large capacity machine purchase; (2) Leaner - Lower operational costs, Three security environments via modular fabric, Expandable, upgradeable, extensible, and Designed for 6yr. life cycle; and (3) Greener - 15% less power-1/6th power per flop, 40% less water-5M gallons saved annually, 10X better cooling efficiency, and 4x denser footprint.

More Details

TYPE Conference YEAR 2010

OSTI