Publications

7 Results
Skip to search filters

FY18 L2 Milestone #8759 Report: Vanguard Astra and ATSE ? an ARM-based Advanced Architecture Prototype System and Software Environment

Laros, James H.; Pedretti, Kevin P.; Hammond, Simon D.; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan E.; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white pa- per entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia Na- tional Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

FY18 L2 Milestone #6360 Report: Initial Capability of an Arm-based Advanced Architecture Prototype System and Software Environment

Laros, James H.; Pedretti, Kevin P.; Hammond, Simon D.; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan E.; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white pa- per entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia Na- tional Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

Using the Sirocco File System for high-bandwidth checkpoints

Klundt, Ruth A.; Ward, Harry L.

The Sirocco File System, a file system for exascale under active development, is designed to allow the storage software to maximize quality of service through increased flexibility and local decision-making. By allowing the storage system to manage a range of storage targets that have varying speeds and capacities, the system can increase the speed and surety of storage to the application. We instrument CTH to use a group of RAM-based Sirocco storage servers allocated within the job as a high-performance storage tier to accept checkpoints, allowing computation to potentially continue asynchronously of checkpoint migration to slower, more permanent storage. The result is a 10-60x speedup in constructing and moving checkpoint data from the compute nodes. This demonstration of early Sirocco functionality shows a significant benefit for a real I/O workload, checkpointing, in a real application, CTH. By running Sirocco storage servers within a job as RAM-only stores, CTH was able to store checkpoints 10-60x faster than storing to PanFS, allowing the job to continue computing sooner. While this prototype did not include automatic data migration, the checkpoint was available to be pushed or pulled to disk-based storage as needed after the compute nodes continued computing. Future developments include the ability to dynamically spawn Sirocco nodes to absorb checkpoints, expanding this mechanism to other fast tiers of storage like flash memory, and sharing of dynamic Sirocco nodes between multiple jobs as needed.

More Details
7 Results
7 Results