Publications Search

Assessment of Data-Management Infrastructure Needs for Production Use of Advanced Machine Learning and Artificial Intelligence: Tri-Lab Level II Milestone (8554)

Oldfield, Ron A.; Allan, Benjamin A.; Doutriaux, Charles; Lewis, Katherine; Ahrens, James; Sims, Benjamin; Sweeney, Christine; Banesh, Divya; Wofford, Quincy

A robust data-management infrastructure is a key enabler for National Security Enterprise (NSE) capabilities in artificial intelligence and machine learning. This document describes efforts from a team of researchers at Sandia National Laboratories, Los Alamos National Laboratory, and Livermore National Laboratory to complete ASC Level II milestone #8854 “Assessment of Data-Management Infrastructure Needs for Production use of Advanced Machine learning and Artificial Intelligence.”

More Details

TYPE SAND Report YEAR 2023

DOI OSTI

AI for Security: NNSA's Role in AI4SES

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Presenation YEAR 2022

DOI OSTI

The ASC Advanced Machine Learning Initiative at Sandia National Laboratories: FY21 Accomplishments and FY22 Plans

Oldfield, Ron A.; Kramer, Sharlotte L.; Rushdi, Ahmad R.; Laros, James H.; Emery, John M.; Kuberry, Paul A.; Ray, Jaideep R.; Ackerman, Sarah A.; Cyr, Eric C.; Saavedra, Gary J.; Hughes, Clayton H.; Cardwell, Suma G.; Smith, John D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Memo regarding the Final Review of FY21 ASC L2 Milestone 7840: Neural Mini-Apps for Future Heterogeneous HPC Systems

Oldfield, Ron A.; Plimpton, Steven J.; Laros, James H.; Poliakoff, David Z.; Sornborger, Andrew

The final review for the FY21 Advanced Simulation and Computing (ASC) Computational Systems and Software Environments (CSSE) L2 Milestone #7840 was conducted on August 25th, 2021 at Sandia National Laboratories in Albuquerque, New Mexico. The review committee/panel unanimously agreed that the milestone has been successfully completed, exceeding expectations on several of the key deliverables.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

The ASC Advanced Machine Learning Initiative at Sandia National Laboratories

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

SNL ATDM Software Technologies. ECP Capability Assessment Report for Software Technologies

Oldfield, Ron A.; Wolf, Michael W.; Brightwell, Ronald B.

The Exascale Computing Project (ECP) Capability Assessment Report for Software Technologies at Sandia National Laboratories is provided. The projects are now aggregated to include Kokkos, Kokkos Kernels, VTK-m Operating Systems, and On-Node Runtime efforts. Key challenges and solution strategies are presented for each.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

FY20 CSSE L2 Milestone 7186

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd H.; Levy, Scott L.; Lofstead, Gerald F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig D.; Widener, Patrick W.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Data Services for Visualization and Analysis - ASC Level II Milestone (7186)

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd H.; Levy, Scott L.; Lofstead, Gerald F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig D.; Widener, Patrick W.; Oldfield, Ron A.

A new in transit Data Service is presented and compared to the traditional file-based workflow and the newly refactored in situ Catalyst workflow. Each workflow is enabled by the IOSS mesh interface equipped with data management layers for Exodus and CGNS (file-based), Catalyst (in situ), and FAODEL (in transit). FAODEL is a distributed object store that can transmit data across MPI allocations. Catalyst is a Para View-based visualization capability developed as part of the CSSE Data Services effort. The workflows considered here take SPARC data into Catalyst for visualization post-processing. Although still in unoptimized form, we show that the in transit approach is a viable alternative to file-based and in situ workflows and offers several advantages to both simulation and post-processing developers. Since IOSS is a mature interface with wide adoption across Sandia and externally, each workflow can be reconfigured to use different simulations that generate mesh data and post-processing tools that consume it.

More Details

TYPE SAND Report YEAR 2020

DOI OSTI

September 2019 ECP ST Project Review

Trujillo, Gabrielle T.; Turner, Daniel Z.; Brightwell, Ronald B.; Oldfield, Ron A.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Data Science and Computer Science Research at Sandia National Laboratories

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

ASCR Workshop on In Situ Data Management

Peterka, Tom; Bard, Deborah; Bennett, Janine C.; Bethel, E.W.; Oldfield, Ron A.; Pouchard, Line; Sweeney, Christine; Wolf, Matthew

In January 2019, the U.S. Department of Energy, Office of Science program in Advanced Scientific Computing Research, convened a workshop to identify priority research directions for in situ data management (ISDM). The workshop defined ISDM as the practices, capabilities, and procedures to control the organization of data and enable the coordination and communication among heterogeneous tasks, executing simultaneously in a high-performance computing system, cooperating toward a common objective. The workshop revealed two primary, interdependent motivations for processing and managing data in situ. The first motivation is that the in situ methodology enables scientific discovery from a broad range of data sources over a wide scale of computing platforms: leadership-class systems, clusters, clouds, workstations, and embedded devices at the edge. The successful development of ISDM capabilities will benefit real-time decision-making, design optimization, and data-driven scientific discovery. The second motivation is the need to decrease data volumes. ISDM can make critical contributions to managing large data volumes from computations and experiments to minimize data movement, save storage space, and boost resource efficiency, often while simultaneously increasing scientific precision.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

SNL Data and Visualization: ML Projects at Sandia

Owen, Steven J.; Siefert, Christopher S.; Vineyard, Craig M.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Sandia Day @ Georgia Tech Technical Breakout Sessions

Weaver, Karla W.; Oldfield, Ron A.; Fang, H.E.; Gehl, M.; Muller, Richard P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Data Science and Computer Science R&D at SNL

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

SNL ATDM: In-situ Compression with ParaView/TuckerMPI

Kolla, Hemanth K.; Oldfield, Ron A.; Otahal, Thomas J.; Baker, Gavin M.; Mauldin, Jeffrey A.; Kolda, Tamara G.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

SNL ATDM Data and Visualization: IOSS and FAODEL

Oldfield, Ron A.; Ulmer, Craig D.; Sjaardema, Gregory D.

The SNL ATDM Data and Visualization project is developing data management software to improve how applications store and exchange large datasets efficiently on Exascale platforms. The data portion of this project is composed of two related efforts: (1) production work focused on improving Sandia's IOSS library for mesh datasets and (2) research work focused on developing new communication software named FAODEL that enables applications in a workflow to exchange data more efficiently.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1

Oldfield, Ron A.; Ulmer, Craig D.; Widener, Patrick W.; Ward, Harry L.

Recent high-performance computing (HPC) platforms such as the Trinity Advanced Technology System (ATS-1) feature burst buffer resources that can have a dramatic impact on an application’s I/O performance. While these non-volatile memory (NVM) resources provide a new tier in the storage hierarchy, developers must find the right way to incorporate the technology into their applications in order to reap the benefits. Similar to other laboratories, Sandia is actively investigating ways in which these resources can be incorporated into our existing libraries and workflows without burdening our application developers with excessive, platform-specific details. This FY18Q1 milestone summaries our progress in adapting the Sandia Parallel Aerodynamics and Reentry Code (SPARC) in Sandia’s ATDM program to leverage Trinity’s burst buffers for checkpoint/restart operations. We investigated four different approaches with varying tradeoffs in this work: (1) simply updating job script to use stage-in/stage out burst buffer directives, (2) modifying SPARC to use LANL’s hierarchical I/O (HIO) library to store/retrieve checkpoints, (3) updating Sandia’s IOSS library to incorporate the burst buffer in all meshing I/O operations, and (4) modifying SPARC to use our Kelpie distributed memory library to store/retrieve checkpoints. Team members were successful in generating initial implementation for all four approaches, but were unable to obtain performance numbers in time for this report (reasons: initial problem sizes were not large enough to stress I/O, and SPARC refactor will require changes to our code). When we presented our work to the SPARC team, they expressed the most interest in the second and third approaches. The HIO work was favored because it is lightweight, unobtrusive, and should be portable to ATS-2. The IOSS work is seen as a long-term solution, and is favored because all I/O work (including checkpoints) can be deferred to a single library.

More Details

TYPE Other Report YEAR 2018

DOI DOI OSTI OSTI

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon D.; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron A.; Lang, Mike; Martin, Steve

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

ATDM Data Warehouse: Data Management Services for Exascale Computing

Ulmer, Craig D.; Oldfield, Ron A.; Kordenbrock, Todd H.; Levy, Scott L.; Lofstead, Gerald F.; Mukherjee, Shyamali M.; Templet, Gary J.; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

Journal of Supercomputing

Son, Seung W.; Sehrish, Saba; Liao, Wei K.; Oldfield, Ron A.; Choudhary, Alok

In petascale systems with a million CPU cores, scalable and consistent I/O performance is becoming increasingly difficult to sustain mainly because of I/O variability. The I/O variability is caused by concurrently running processes/jobs competing for I/O or a RAID rebuild when a disk drive fails. We present a mechanism that stripes across a selected subset of I/O nodes with the lightest workload at runtime to achieve the highest I/O bandwidth available in the system. In this paper, we propose a probing mechanism to enable application-level dynamic file striping to mitigate I/O variability. We implement the proposed mechanism in the high-level I/O library that enables memory-to-file data layout transformation and allows transparent file partitioning using subfiling. Subfiling is a technique that partitions data into a set of files of smaller size and manages file access to them, making data to be treated as a single, normal file to users. We demonstrate that our bandwidth probing mechanism can successfully identify temporally slower I/O nodes without noticeable runtime overhead. Experimental results on NERSC’s systems also show that our approach isolates I/O variability effectively on shared systems and improves overall collective I/O performance with less variation.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

ATDM Data Warehouse

Ulmer, Craig D.; Kordenbrock, Todd H.; Levy, Scott L.; Lofstead, Gerald F.; Mukherjee, Shyamali M.; Sjaardema, Gregory D.; Templet, Gary J.; Widener, Patrick W.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI OSTI

Demonstrate and Evaluate Advanced Analysis Visualization and I/O Capabilities for the SIERRA Toolkit

Oldfield, Ron A.; Crossno, Patricia J.; Otahal, Thomas J.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

ATDM Data Management FY2015: Data Warehouse Progress Report

Ulmer, Craig D.; Fabian, Nathan D.; Kordenbrock, Todd H.; Mukherjee, Shyamali M.; Oldfield, Ron A.; Templet, Gary J.

The Advanced Technology Development and Mitigation (ATDM) program at Sandia National Laboratories is a new effort to build next-generation simulation codes that will map well to upcoming exascale computing platforms. Rather than follow traditional single- program, multiple data (SPMD) programming techniques, ATDM is developing applications in an asynchronous many task (AMT) form that describes work as a graph of tasks that have data dependencies. The data management team is focused on developing a data warehouse for ATDM that will enable tasks to store and exchange data objects efficiently. This report summarizes the data management teams efforts during FY15, and documents: (1) an initial API and implementation for the data warehouses key/value store, (2) API requirements for use with ATDMs runtime, (3) initial requirements for storing ATDM-specific data, and (4) the current organization of software components that will be used by the data warehouse.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Delta: Data Reduction for Integrated Application Workflows

Lofstead, Gerald F.; Jean-Baptiste, Gregory; Oldfield, Ron A.

Integrated Application Workflows (IAWs) run multiple simulation workflow components con- currently on an HPC resource connecting these components using compute area resources and compensating for any performance or data processing rate mismatches. These IAWs require high frequency and high volume data transfers between compute nodes and staging area nodes during the lifetime of a large parallel computation. The available network band- width between the two areas may not be enough to efficiently support the data movement. As the processing power available to compute resources increases, the requirements for this data transfer will become more difficult to satisfy and perhaps will not be satisfiable at all since network capabilities are not expanding at a comparable rate. Furthermore, energy consumption in HPC environments is expected to grow by an order of magnitude as exas- cale systems become a reality. The energy cost of moving large amounts of data frequently will contribute to this issue. It is necessary to reduce the volume of data without reducing the quality of data when it is being processed and analyzed. Delta resolves the issue by addressing the lifetime data transfer operations. Delta removes subsequent identical copies of already transmitted data during transfers and restores those copies once the data has reached the destination. Delta is able to identify duplicated information and determine the most space efficient way to represent it. Initial tests show about 50% reduction in data movement while maintaining the same data quality and transmission frequency.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Storage Systems and Input/Output to Support Extreme Scale Science

Ross, Robert; Grider, Gary; Felix, Evan; Gary, Mark; Klasky, Scott; Oldfield, Ron A.; Shipman, Galen; Wu, John

Storage systems are a foundational component of computational, experimental, and observational science today. The success of Department of Energy (DOE) activities in these areas is inextricably tied to the usability, performance, and reliability of storage and input/output (I/O) technologies.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

Enabling Capabilities for Intergrated Application Workflows

Lofstead, Gerald F.; Curry, Matthew L.; Fabian, Nathan D.; Kordenbrock, Todd H.; Mukherjee, Shyamali M.; Oldfield, Ron A.; Sjaardema, Gregory D.; Templet, Gary J.; Ulmer, Craig D.; Widener, Patrick W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Formal Metrics for Large-Scale Parallel Performance

Moreland, Kenneth D.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

You Got Visualization in my Simulation: Integration of Simulation Analysis and Visualization at Extreme Scales

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Extreme-Scale Challenges for Integrating Simulation Analysis and Visualization

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Enabling Capabilities for Analysis at Extreme Scale

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Formal metrics for large-scale parallel performance

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Moreland, Kenneth D.; Oldfield, Ron A.

Performance measurement of parallel algorithms is well studied and well understood. However, a flaw in traditional performance metrics is that they rely on comparisons to serial performance with the same input. This comparison is convenient for theoretical complexity analysis but impossible to perform in large-scale empirical studies with data sizes far too large to run on a single serial computer. Consequently, scaling studies currently rely on ad hoc methods that, although effective, have no grounded mathematical models. In this position paper we advocate using a rate-based model that has a concrete meaning relative to speedup and efficiency and that can be used to unify strong and weak scaling studies.

More Details

TYPE Conference Poster YEAR 2015

OSTI Scopus

Addressing Scientific I/O Needs for Current and Future Architectures

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Delta: Data Reduction for Integrated Application Workflows}

Jean-Baptiste, Gregory; Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Formal Metrics for Large-Scale Parallel Performance

Moreland, Kenneth D.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

Evaluation of Methods to Integrate Analysis into a Large-Scale Shock Physics Code

Oldfield, Ron A.; Fabian, Nathan D.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI DOI

Approaching Production In Situ Visualization for Extreme Scale Analysis (SIAM PP Minisymposium)

Moreland, Kenneth D.; Oldfield, Ron A.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Data Co-Processing for Extreme Scale Analysis

Fabian, Nathan D.; Oldfield, Ron A.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Investigating the integration of supercomputers and data-Warehouse appliances

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Oldfield, Ron A.; Davidson, George; Ulmer, Craig D.; Wilson, Andrew T.

Two decades of experience with massively parallel supercomputing has given insight into the problem domains where these architectures are cost effective. Likewise experience with database machines and more recently massively parallel database appliances has shown where these architectures are valuable. Combining both architectures to simultaneously solve problems has received much less attention. In this paper, we describe a motivating application for economic modeling that requires both HPC and database capabilities. Then we discuss hardware and software integration issues related to a direct integration of a Cray XT supercomputer and a Netezza database appliance. © 2014 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2014

Scopus OSTI

Productivity Challenges for Code Coupling in an HPC Environment

Oldfield, Ron A.; Moreland, Kenneth D.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI OSTI

Fourier-Assisted Modeling of Hard Disk Drive Access Times

Oldfield, Ron A.; Ward, Harry L.; Widener, Patrick W.; Kroeger, Thomas M.; Curry, Matthew L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Productivity Challenges for Integrating Simulation and Analysis

Oldfield, Ron A.; Fabian, Nathan D.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Dynamic File Striping and Data Layout Transformation on Parallel System with Fluctuating I/O Workload

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Extreme-Scale CoProcessing: An Evaluation of In Situ and In Transit Analysis

Oldfield, Ron A.; Fabian, Nathan D.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Hobbes: Global Information Bus

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Behavior-Based Simulation of Storage Devices

Ward, Harry L.; Oldfield, Ron A.; Widener, Patrick W.; Curry, Matthew L.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

D2T: Doubly Distributed Transactions for High Performance and Distributed Computing

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Trilinos I/O Support: Update

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Benefits and Challenges of Integration Simulation and Analysis

Oldfield, Ron A.; Moreland, Kenneth D.; Fabian, Nathan D.; Rogers, David R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Composition and Virtualization as the Foundations of an Extreme-scale OS/R

Brightwell, Ronald B.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Hey You got your DWA in my HPC: Experiences Integrating Analytics and HPC

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Data co-processing for extreme scale analysis level II ASC milestone (4745)

Rogers, David R.; Moreland, Kenneth D.; Oldfield, Ron A.; Fabian, Nathan D.

Exascale supercomputing will embody many revolutionary changes in the hardware and software of high-performance computing. A particularly pressing issue is gaining insight into the science behind the exascale computations. Power and I/O speed con- straints will fundamentally change current visualization and analysis work ows. A traditional post-processing work ow involves storing simulation results to disk and later retrieving them for visualization and data analysis. However, at exascale, scien- tists and analysts will need a range of options for moving data to persistent storage, as the current o ine or post-processing pipelines will not be able to capture the data necessary for data analysis of these extreme scale simulations. This Milestone explores two alternate work ows, characterized as in situ and in transit, and compares them. We nd each to have its own merits and faults, and we provide information to help pick the best option for a particular use.

More Details

TYPE SAND Report YEAR 2013

DOI OSTI

Integrating Analysis and Computation: Experiences with In-Situ and In-Transit Approaches

Oldfield, Ron A.; Rogers, David R.; Moreland, Kenneth D.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalability and Capability Improvements for In-Situ and In-Transit Analysis

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

CSSE L2 Milestone - Data Co-Processing for Extreme Scale Analysis - Midterm Review

Moreland, Kenneth D.; Rogers, David R.; Fabian, Nathan D.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

The Structural Simulation Toolkit

Proposed for publication in SIGMETRICS Performance Evaluation Review.

Rodrigues, Arun; Hemmert, Karl S.; Barrett, Brian B.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Experiences Applying Data Staging Technology in Unconventional Ways

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Hello ADIOS: The Challenges and Lessons of Developing Leadership Class I/O Frameworks

Proposed for publication in Concurrency and Computation: Practice and Experience.

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Integrating Analysis and Computation with Trios Services

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

In-Situ Visualization with Catalyst

Fabian, Nathan D.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

D2T: Doubly Distributed Transactions for High Performance and Distributed Computing

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

D2T: Doubly Distributed Transactions for High Performance and Distributed Computing

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

D2T: Doubly Distributed Transactions for High Performance and Distributed Computing

Lofstead, Gerald F.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Developing Integrated Data Services for Cray Systems with a Gemini Interconnect

Oldfield, Ron A.; Lofstead, Gerald F.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin T.T.; Rodrigues, Arun; Barrett, Richard F.; Thompson, David C.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

DOI OSTI

Extreme-Scale Analytics: Controlling and Provisioning Online Analytics for Dynamic End User Requirements

Lofstead, Gerald F.; Oldfield, Ron A.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Valuing and Managing Data Based on Embodied Energy

Lofstead, Gerald F.; Oldfield, Ron A.; Curry, Matthew L.; Laros, James H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Supporting Integrated Data Services: A New Challenge for HPC Computing

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Extending scalability of collective IO through nessie and staging

PDSW'11 - Proceedings of the 6th Parallel Data Storage Workshop, Co-located with SC'11

Lofstead, Jay; Oldfield, Ron A.; Kordenbrock, Todd; Reiss, Charles

The increasing fidelity of scientific simulations as they scale towards exascale sizes is straining the proven IO techniques championed throughout terascale computing. Chief among the successful IO techniques is the idea of collective IO where processes coordinate and exchange data prior to writing to storage in an effort to reduce the number of small, independent IO operations. As well as collective IO works for efficiently creating a data set in the canonical order, 3-D domain decompositions prove troublesome due to the amount of data exchanged prior to writing to storage. When each process has a tiny piece of a 3-D simulation space rather than a complete 'pencil' or 'plane', 2-D or 1-D domain decompositions respectively, the communication overhead to rearrange the data can dwarf the time spent actually writing to storage [27]. Our approach seeks to transparently increase scalability and performance while maintaining both the IO routines in the application and the final data format in the storage system. Accomplishing this leverages both the Nessie [23] RPC framework and a staging area with staging services. Through these tools, we employ a variety of data processing operations prior to invoking the native API to write data to storage yielding as much as a 3X performance improvement over the native calls. © 2011 ACM.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Data Services and Trilinos: A Brief Introduction to Trios Data Services

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Extending Scalability of Collective IO Through Nessie and Staging

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Examples of In Transit Visualization (Presentation Slides)

Moreland, Kenneth D.; Oldfield, Ron A.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Resilient data staging through MxN distributed transactions

Lofstead, Gerald F.; Oldfield, Ron A.

Scientific computing-driven discoveries are frequently driven from workflows that use persistent storage as a staging area for data between operations. With the bad and progressively worse bandwidth vs. data size issues as we continue towards exascale, eliminating persistent storage through techniques like data staging will both enable these workflows to continue online, but also enable more interactive workflows reducing the time to scientific discoveries. Data staging has shown to be an effective way for applications running on high-end computing platforms to offload expensive I/O operations and to manage the tremendous amounts of data they produce. This data staging approach, however, lacks the ACID style guarantees traditional straight-to-disk methods provide. Distributed transactions are a proven way to add ACID properties to data movements, however distributed transactions follow 1xN data movement semantics, where our highly parallel HPC environments employ MxN data movement semantics. In this paper we present a novel protocol that extends distributed transaction terminology to include MxN semantics which allows our data staging areas to benefit from ACID properties. We show that with our protocol we can provide resilient data staging with a limited performance penalty over current data staging implementations.

More Details

TYPE SAND Report YEAR 2011

OSTI DOI

Data Services and Trilinos: Addressing I/O Challenges for Exascale Applications

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Examples of In Transit Visualization

Moreland, Kenneth D.; Oldfield, Ron A.; Fabian, Nathan D.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Keeping checkpoint/restart viable for exascale systems

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

Addressing Scalable I/O Challenges for Exascale

Oldfield, Ron A.; Ferreira, Kurt; Ward, Harry L.; Curry, Matthew L.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

Evaluating the Viability of Process Replication Reliability for Exascale Systems

Ferreira, Kurt; Stearley, Jon S.; Laros, James H.; Oldfield, Ron A.; Pedretti, Kevin P.; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Final report for %22High performance computing for advanced national electric power grid modeling and integration of solar generation resources%22, LDRD Project No. 149016

Schoenwald, David A.; Richardson, Bryan T.; Riehm, Andrew C.; Wolfenbarger, Paul W.; Adams, Brian M.; Reno, Matthew J.; Hansen, Clifford H.; Oldfield, Ron A.; Stamp, Jason E.; Stein, Joshua S.; Hoekstra, Robert J.; Nelson, Jeffrey S.; Munoz-Ramos, Karina M.; McLendon, William C.; Russo, Thomas V.; Phillips, Laurence R.

Design and operation of the electric power grid (EPG) relies heavily on computational models. High-fidelity, full-order models are used to study transient phenomena on only a small part of the network. Reduced-order dynamic and power flow models are used when analysis involving thousands of nodes are required due to the computational demands when simulating large numbers of nodes. The level of complexity of the future EPG will dramatically increase due to large-scale deployment of variable renewable generation, active load and distributed generation resources, adaptive protection and control systems, and price-responsive demand. High-fidelity modeling of this future grid will require significant advances in coupled, multi-scale tools and their use on high performance computing (HPC) platforms. This LDRD report demonstrates SNL's capability to apply HPC resources to these 3 tasks: (1) High-fidelity, large-scale modeling of power system dynamics; (2) Statistical assessment of grid security via Monte-Carlo simulations of cyber attacks; and (3) Development of models to predict variability of solar resources at locations where little or no ground-based measurements are available.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

Six Degrees of Scientific Data: Reading Patterns for Extreme Scale Science IO

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Redundant computing for exascale systems

Ferreira, Kurt; Stearley, Jon S.; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Trilinos I/O Support: Capability Update

Oldfield, Ron A.; Sjaardema, Gregory D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Trilinos I/O Support: Proposed I/O Software for FY11

Oldfield, Ron A.; Sjaardema, Gregory D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Managing variability in the IO performance of petascale storage systems

Oldfield, Ron A.

Significant challenges exist for achieving peak or even consistent levels of performance when using IO systems at scale. They stem from sharing IO system resources across the processes of single large-scale applications and/or multiple simultaneous programs causing internal and external interference, which in turn, causes substantial reductions in IO performance. This paper presents interference effects measurements for two different file systems at multiple supercomputing sites. These measurements motivate developing a 'managed' IO approach using adaptive algorithms varying the IO system workload based on current levels and use areas. An implementation of these methods deployed for the shared, general scratch storage system on Oak Ridge National Laboratory machines achieves higher overall performance and less variability in both a typical usage environment and with artificially introduced levels of 'noise'. The latter serving to clearly delineate and illustrate potential problems arising from shared system usage and the advantages derived from actively managing it.

More Details

TYPE Conference YEAR 2010

OSTI

High Performance Computing and Informatics

Oldfield, Ron A.; Bader, Brett W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Hey! You Got Your DWA in My HPC : Experiences Integrating Netezza and Cray XT3

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Application-Level Data Services

Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Stearley, Jon S.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

On the path to exascale

International Journal of Distributed Systems and Technologies

Alvin, Kenneth F.; Barrett, Brian B.; Brightwell, Ronald B.; Dosanjh, Sudip S.; Geist, Al; Hemmert, Karl S.; Heroux, Michael; Kothe, Doug; Murphy, Richard C.; Nichols, Jeff; Oldfield, Ron A.; Rodrigues, Arun; Vetter, Jeffrey S.

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

More Details

TYPE Journal Article YEAR 2010

Scopus OSTI

System Software Research for Extreme-Scale Computing

Oldfield, Ron A.; Brightwell, Ronald B.; Pedretti, Kevin P.; Riesen, Rolf; Ferreira, Kurt; Kelly, Suzanne M.; Laros, James H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Advanced I/O for large-scale scientific applications

Oldfield, Ron A.

As scientific simulations scale to use petascale machines and beyond, the data volumes generated pose a dual problem. First, with increasing machine sizes, the careful tuning of IO routines becomes more and more important to keep the time spent in IO acceptable. It is not uncommon, for instance, to have 20% of an application's runtime spent performing IO in a 'tuned' system. Careful management of the IO routines can move that to 5% or even less in some cases. Second, the data volumes are so large, on the order of 10s to 100s of TB, that trying to discover the scientifically valid contributions requires assistance at runtime to both organize and annotate the data. Waiting for offline processing is not feasible due both to the impact on the IO system and the time required. To reduce this load and improve the ability of scientists to use the large amounts of data being produced, new techniques for data management are required. First, there is a need for techniques for efficient movement of data from the compute space to storage. These techniques should understand the underlying system infrastructure and adapt to changing system conditions. Technologies include aggregation networks, data staging nodes for a closer parity to the IO subsystem, and autonomic IO routines that can detect system bottlenecks and choose different approaches, such as splitting the output into multiple targets, staggering output processes. Such methods must be end-to-end, meaning that even with properly managed asynchronous techniques, it is still essential to properly manage the later synchronous interaction with the storage system to maintain acceptable performance. Second, for the data being generated, annotations and other metadata must be incorporated to help the scientist understand output data for the simulation run as a whole, to select data and data features without concern for what files or other storage technologies were employed. All of these features should be attained while maintaining a simple deployment for the science code and eliminating the need for allocation of additional computational resources.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Lightweight storage and overlay networks for fault tolerance

Oldfield, Ron A.

The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-scale systems that execute applications which consume most of the memory of the employed compute nodes, checkpoint operations generate I/O that consumes nearly 80% of the total I/O usage. Motivated by this observation, this project aims to improve I/O performance for application-directed checkpoints through the use of lightweight storage architectures and overlay networks. Lightweight storage provide direct access to underlying storage devices. Overlay networks provide caching and processing capabilities in the compute-node fabric. The combination has potential to signifcantly reduce I/O overhead for large-scale applications. This report describes our combined efforts to model and understand overheads for application-directed checkpoints, as well as implementation and performance analysis of a checkpoint service that uses available compute nodes as a network cache for checkpoint operations.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

More Details

TYPE SAND Report YEAR 2009

DOI OSTI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI