Publications Search

Containers and the Truth between HPC & Cloud System Software Convergence

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

ALAMO: Autonomous lightweight allocation, management, and optimization

Communications in Computer and Information Science

Brightwell, Ronald B.; Ferreira, Kurt B.; Grant, Ryan; Levy, Scott; Lofstead, Gerald (Jay) F.; Olivier, Stephen L.; Bays, Nathan R.; Younge, Andrew J.; Gentile, Ann C.; Bays, Nathan R.

Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.

More Details

TYPE Conference Poster YEAR 2021

OSTI Scopus OSTI Scopus

HPC Operating SystemResearch Areas and Challenges

Bays, Nathan R.; Brightwell, Ronald B.; Younge, Andrew J.; Lange, Jack

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Containers for the Modernization of HPC Software Deployment

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Chronicles of astra: Challenges and lessons from the first petascale arm supercomputer

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Bays, Nathan R.; Younge, Andrew J.; Hammond, Simon; Bays, Nathan R.; Curry, Matthew; Aguilar, Michael J.; Hoekstra, Robert J.; Brightwell, Ronald B.

Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Early Experiences with A64FX

Hammond, Simon; Younge, Andrew J.; Bays, Nathan R.; Bays, Nathan R.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Enabling Power Measurement and Control on Astra: The First Petascale Arm Supercomputer

Grant, Ryan; Hammond, Simon; Bays, Nathan R.; Levenhagen, Michael; Olivier, Stephen L.; Bays, Nathan R.; Ward, Harry L.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

OSTI

CANOPIE-HPC Workshop at SC20

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Modern Container Runtimes for Exascale computing era

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Job Modeling for Power Forecasting and Analysis on the Astra Supercomputer

Wang, Felix W.; Bays, Nathan R.; Vineyard, Craig M.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

OSTI

Chronicles of Astra: Challenges and Lessons from theFirst Petascale Arm Supercomputer

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

ECP Container Status 2020

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Towards Containerized HPC Applications at Exascale

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Containers in HPC: Testbeds Production and Towards Exascale

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI OSTI

Containers and the Future of Supercomputing

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

HPC Container Runtimes: A Quick Primer

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Supercomputing with Containers: Practice Experiences and Tupperware

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Container Utilization at DOE Compute Facilities

Younge, Andrew J.; Agelastos, Anthony M.; Bays, Nathan R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Machines Learning about Machines - ML for Analysis and Control of HPC Infrastructure

Wang, Felix W.; Green, Sam; Bays, Nathan R.; Vineyard, Craig M.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ECP Tutorial: Getting Started with Containers on HPC

Younge, Andrew J.; Canon, Shane; Shende, Sameer

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

HPC Containers Usage Sandia National Laboratories

Younge, Andrew J.; Agelastos, Anthony M.; Lawson, Gary; Bays, Nathan R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

ECP Supercontainers:2.3.5.09 Packaging Technologies

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Data Pallets: Containerizing Storage For Reproducibility and Traceability

Lecture Notes in Computer Science

Lofstead, Gerald (Jay) F.; Baker, Joshua; Younge, Andrew J.

Trusting simulation output is crucial for Sandia’s mission objectives. Here, we rely on these simulations to perform our high-consequence mission tasks given national treaty obligations. Other science and modeling applications, while they may have high-consequence results, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed workflow and provenance systems to aid in both automating simulation and modeling execution as well as determining exactly how was some output was created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated “sandbox” and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project explores extending the container concept to include storage as a new container type we call data pallets. Data Pallets are potentially writeable, auto generated by the system based on IO activities, and usable as a way to link the contained data back to the application and input deck used to create it.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI

Enabling HPC workloads on Cloud Infrastructure using Kubernetes Container Orchestration Mechanisms

Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

A Case for Portability and Reproducibility of HPC Containers

Proceedings of CANOPIE-HPC 2019: 1st International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Canon, Richard S.; Younge, Andrew J.

Containerized computing is quickly changing the landscape for the development and deployment of many HPC applications. Containers are able to lower the barrier of entry for emerging workloads to leverage supercomputing resources. However, containers are no silver bullet for deploying HPC software and there are several challenges ahead in which the community must address to ensure container workloads can be reproducible and inter-operable. In this paper, we discuss several challenges in utilizing containers for HPC applications and the current approaches used in many HPC container runtimes. These approaches have been proven to enable high-performance execution of containers at scale with the appropriate runtimes. However, the use of these techniques are still ad hoc, test the limits of container workload portability, and several gaps likely remain. We discuss those remaining gaps and propose several potential solutions, including custom container label tagging and runtime hooks as a first step in managing HPC system library complexity.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Publications

Search results