Page 4 – Center for Computing Research (CCR)

Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given our treaty obligations. Other science and modelling needs, while they may not be high-consequence, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed work- flow and provenance systems to aid in both automating simulation and modelling execution, but to also aid in determining exactly how was some output created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated "sandbox" and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project was an initial exploration into extending the container concept to also include storage and to use writable containers, auto generated by the system, as a way to link the contained data back to the simulation and input deck used to create it.

More Details

TYPE SAND Report YEAR 2018

OSTI DOI

Quantifying Metrics to Evaluate Containers for Deployment and Usage of NNSA Production Applications

Younge, Andrew J.; Agelastos, Anthony M.; Lofstead, Gerald F.; Warren, Aron W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Vanguard Astra - Petascale ARM Platform for U.S. DOE/ASC Supercomputing

Younge, Andrew J.; Laros, James H.; Hammond, Simon D.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

FY18 L2 Milestone #8759 Report: Vanguard Astra and ATSE ? an ARM-based Advanced Architecture Prototype System and Software Environment

Laros, James H.; Pedretti, Kevin P.; Hammond, Simon D.; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan E.; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white pa- per entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia Na- tional Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

TYPE SAND Report YEAR 2018

OSTI DOI

FY18 L2 Milestone #6360 Report: Initial Capability of an Arm-based Advanced Architecture Prototype System and Software Environment

Laros, James H.; Pedretti, Kevin P.; Hammond, Simon D.; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan E.; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white pa- per entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia Na- tional Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

TYPE SAND Report YEAR 2018

OSTI DOI

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Pedretti, Kevin P.; Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI DOI

HPC at Sandia: Exploring the Virtualization and Containerization of ARM64 Processors for Future HPC Workloads

Younge, Andrew J.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

HPC Containerization with Singularity

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Leveraging Containerization for DevOps with Sandia's HPC Workloads

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Vanguard Astra - Petascale ARM Platform for U.S. DOE/ASC Supercomputing

Younge, Andrew J.; Pedretti, Kevin P.; Laros, James H.; Hammond, Simon D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Portals 4: Status of Specification and Implementation

Younge, Andrew J.; Grant, Ryan E.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Containers in HPC: Scaling form Laptops to Supercomputers

Younge, Andrew J.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Advanced Power Measurement and Control for the Trinity Supercomputer

Younge, Andrew J.; Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Pedretti, Kevin P.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Supporting High Performance Analytics with System Software for Virtualized Supercomputing

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

A Comparison of Power Management Mechanisms: P-states vs. Node-Level Power Cap Control

Pedretti, Kevin P.; Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

Evaluating energy and power profiling techniques for HPC workloads

2017 8th International Green and Sustainable Computing Conference, IGSC 2017

Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Pedretti, Kevin P.; Ward, Harry L.; Younge, Andrew J.

Advanced power measurement capabilities are becoming available on large scale High Performance Computing (HPC) deployments. There exist several approaches to providing power measurements today, primarily through in-band (e.g. RAPL) and out-of-band measurements (e.g. power meters). Both types of measurement can be augmented with application-level profiling, however it can be difficult to assess the type and detail of measurement needed to obtain insight from the application power profile. This paper presents a taxonomy for classifying power profiling techniques on modern HPC platforms. Three HPC mini-applications are analyzed across three production HPC systems to examine the level of detail, scope, and complexity of these power profiles. We demonstrate that a combination of out-of-band measurement with in-band application region profiling can provide an accurate, detailed view of power usage without introducing overhead. This work also provides a set of recommendations for how to best profile HPC workloads.

More Details

TYPE Conference Poster YEAR 2018

Scopus OSTI

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Younge, Andrew J.; Pedretti, Kevin P.; Grant, Ryan E.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI DOI

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Younge, Andrew J.; Pedretti, Kevin P.; Grant, Ryan E.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI DOI

Evaluating Energy and Power Profiling Techniques for HPC Workloads

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Practice and Experience using Containers in HPC

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Enabling Diverse Software Stacks on Supercomputers Using High Performance Virtual Clusters

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Younge, Andrew J.; Pedretti, Kevin P.; Grant, Ryan E.; Gaines, Brian G.; Brightwell, Ronald B.

While large-scale simulations have been the hallmark of the High Performance Computing (HPC) community for decades, Large Scale Data Analytics (LSDA) workloads are gaining attention within the scientific community not only as a processing component to large HPC simulations, but also as standalone scientific tools for knowledge discovery. With the path towards Exascale, new HPC runtime systems are also emerging in a way that differs from classical distributed computing models. However, system software for such capabilities on the latest extreme-scale DOE supercomputing needs to be enhanced to more appropriately support these types of emerging software ecosystems.In this paper, we propose the use of Virtual Clusters on advanced supercomputing resources to enable systems to support not only HPC workloads, but also emerging big data stacks. Specifically, we have deployed the KVM hypervisor within Cray's Compute Node Linux on a XC-series supercomputer testbed. We also use libvirt and QEMU to manage and provision VMs directly on compute nodes, leveraging Ethernet-over-Aries network emulation. To our knowledge, this is the first known use of KVM on a true MPP supercomputer. We investigate the overhead our solution using HPC benchmarks, both evaluating single-node performance as well as weak scaling of a 32-node virtual cluster. Overall, we find single node performance of our solution using KVM on a Cray is very efficient with near-native performance. However overhead increases by up to 20% as virtual cluster size increases, due to limitations of the Ethernet-over-Aries bridged network. Furthermore, we deploy Apache Spark with large data analysis workloads in a Virtual Cluster, effectively demonstrating how diverse software ecosystems can be supported by High Performance Virtual Clusters.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI

Project Vanguard - Prototyping a large-scale ARM-based HPC platform at Sandia

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Publications