Publications Search

Assessment of Data-Management Infrastructure Needs for Production Use of Advanced Machine Learning and Artificial Intelligence: Tri-Lab Level II Milestone (8554)

Oldfield, Ron; Allan, Benjamin A.; Doutriaux, Charles; Lewis, Katherine; Ahrens, James; Sims, Benjamin; Sweeney, Christine; Banesh, Divya; Wofford, Quincy

A robust data-management infrastructure is a key enabler for National Security Enterprise (NSE) capabilities in artificial intelligence and machine learning. This document describes efforts from a team of researchers at Sandia National Laboratories, Los Alamos National Laboratory, and Livermore National Laboratory to complete ASC Level II milestone #8854 “Assessment of Data-Management Infrastructure Needs for Production use of Advanced Machine learning and Artificial Intelligence.”

More Details

TYPE SAND Report YEAR 2023

DOI OSTI

AI for Security: NNSA's Role in AI4SES

Oldfield, Ron

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

The ASC Advanced Machine Learning Initiative at Sandia National Laboratories: FY21 Accomplishments and FY22 Plans

Oldfield, Ron; Kramer, S.L.B.; Rushdi, Ahmad; Foulk, James W.; Emery, John M.; Kuberry, Paul; Ray, Jaideep; Ackerman, Sarah; Cyr, Eric C.; Saavedra, Gary; Hughes, Clayton; Cardwell, Suma G.; Smith, J.D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Memo regarding the Final Review of FY21 ASC L2 Milestone 7840: Neural Mini-Apps for Future Heterogeneous HPC Systems

Oldfield, Ron; Plimpton, Steven J.; Foulk, James W.; Poliakoff, David; Sornborger, Andrew

The final review for the FY21 Advanced Simulation and Computing (ASC) Computational Systems and Software Environments (CSSE) L2 Milestone #7840 was conducted on August 25th, 2021 at Sandia National Laboratories in Albuquerque, New Mexico. The review committee/panel unanimously agreed that the milestone has been successfully completed, exceeding expectations on several of the key deliverables.

More Details

TYPE Other Report YEAR 2021

DOI OSTI

The ASC Advanced Machine Learning Initiative at Sandia National Laboratories

Oldfield, Ron

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

SNL ATDM Software Technologies. ECP Capability Assessment Report for Software Technologies

Oldfield, Ron; Wolf, Michael; Brightwell, Ronald B.

The Exascale Computing Project (ECP) Capability Assessment Report for Software Technologies at Sandia National Laboratories is provided. The projects are now aggregated to include Kokkos, Kokkos Kernels, VTK-m Operating Systems, and On-Node Runtime efforts. Key challenges and solution strategies are presented for each.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

FY20 CSSE L2 Milestone 7186

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig; Widener, Patrick; Oldfield, Ron

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Data Services for Visualization and Analysis - ASC Level II Milestone (7186)

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig; Widener, Patrick; Oldfield, Ron

A new in transit Data Service is presented and compared to the traditional file-based workflow and the newly refactored in situ Catalyst workflow. Each workflow is enabled by the IOSS mesh interface equipped with data management layers for Exodus and CGNS (file-based), Catalyst (in situ), and FAODEL (in transit). FAODEL is a distributed object store that can transmit data across MPI allocations. Catalyst is a Para View-based visualization capability developed as part of the CSSE Data Services effort. The workflows considered here take SPARC data into Catalyst for visualization post-processing. Although still in unoptimized form, we show that the in transit approach is a viable alternative to file-based and in situ workflows and offers several advantages to both simulation and post-processing developers. Since IOSS is a mature interface with wide adoption across Sandia and externally, each workflow can be reconfigured to use different simulations that generate mesh data and post-processing tools that consume it.

More Details

TYPE SAND Report YEAR 2020

DOI OSTI

September 2019 ECP ST Project Review

Trujillo, Gabrielle; Turner, D.Z.; Brightwell, Ronald B.; Oldfield, Ron; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Data Science and Computer Science Research at Sandia National Laboratories

Oldfield, Ron

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

ASCR Workshop on In Situ Data Management

Peterka, Tom; Bard, Deborah; Bennett, Janine C.; Bethel, E.W.; Oldfield, Ron; Pouchard, Line; Sweeney, Christine; Wolf, Matthew

In January 2019, the U.S. Department of Energy, Office of Science program in Advanced Scientific Computing Research, convened a workshop to identify priority research directions for in situ data management (ISDM). The workshop defined ISDM as the practices, capabilities, and procedures to control the organization of data and enable the coordination and communication among heterogeneous tasks, executing simultaneously in a high-performance computing system, cooperating toward a common objective. The workshop revealed two primary, interdependent motivations for processing and managing data in situ. The first motivation is that the in situ methodology enables scientific discovery from a broad range of data sources over a wide scale of computing platforms: leadership-class systems, clusters, clouds, workstations, and embedded devices at the edge. The successful development of ISDM capabilities will benefit real-time decision-making, design optimization, and data-driven scientific discovery. The second motivation is the need to decrease data volumes. ISDM can make critical contributions to managing large data volumes from computations and experiments to minimize data movement, save storage space, and boost resource efficiency, often while simultaneously increasing scientific precision.

More Details

TYPE Other Report YEAR 2019

DOI OSTI

SNL Data and Visualization: ML Projects at Sandia

Owen, Steven J.; Siefert, Christopher; Vineyard, Craig M.; Oldfield, Ron

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Sandia Day @ Georgia Tech Technical Breakout Sessions

Weaver, Karla; Oldfield, Ron; Fang, H.E.; Gehl, Michael; Muller, Richard P.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Data Science and Computer Science R&D at SNL

Oldfield, Ron

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

SNL ATDM: In-situ Compression with ParaView/TuckerMPI

Kolla, Hemanth; Oldfield, Ron; Otahal, Thomas J.; Baker, Gavin M.; Mauldin, Jeffrey A.; Kolda, Tamara G.; Moreland, Kenneth D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

SNL ATDM Data and Visualization: IOSS and FAODEL

Oldfield, Ron; Ulmer, Craig; Sjaardema, Gregory D.

The SNL ATDM Data and Visualization project is developing data management software to improve how applications store and exchange large datasets efficiently on Exascale platforms. The data portion of this project is composed of two related efforts: (1) production work focused on improving Sandia's IOSS library for mesh datasets and (2) research work focused on developing new communication software named FAODEL that enables applications in a workflow to exchange data more efficiently.

More Details

TYPE Other Report YEAR 2018

DOI OSTI

SPARC: Demonstrate burst-buffer-based checkpoint/restart on ATS-1

Oldfield, Ron; Ulmer, Craig; Widener, Patrick; Ward, Harry L.

Recent high-performance computing (HPC) platforms such as the Trinity Advanced Technology System (ATS-1) feature burst buffer resources that can have a dramatic impact on an application’s I/O performance. While these non-volatile memory (NVM) resources provide a new tier in the storage hierarchy, developers must find the right way to incorporate the technology into their applications in order to reap the benefits. Similar to other laboratories, Sandia is actively investigating ways in which these resources can be incorporated into our existing libraries and workflows without burdening our application developers with excessive, platform-specific details. This FY18Q1 milestone summaries our progress in adapting the Sandia Parallel Aerodynamics and Reentry Code (SPARC) in Sandia’s ATDM program to leverage Trinity’s burst buffers for checkpoint/restart operations. We investigated four different approaches with varying tradeoffs in this work: (1) simply updating job script to use stage-in/stage out burst buffer directives, (2) modifying SPARC to use LANL’s hierarchical I/O (HIO) library to store/retrieve checkpoints, (3) updating Sandia’s IOSS library to incorporate the burst buffer in all meshing I/O operations, and (4) modifying SPARC to use our Kelpie distributed memory library to store/retrieve checkpoints. Team members were successful in generating initial implementation for all four approaches, but were unable to obtain performance numbers in time for this report (reasons: initial problem sizes were not large enough to stress I/O, and SPARC refactor will require changes to our code). When we presented our work to the SPARC team, they expressed the most interest in the second and third approaches. The HIO work was favored because it is lightweight, unobtrusive, and should be portable to ATS-2. The IOSS work is seen as a long-term solution, and is favored because all I/O work (including checkpoints) can be deferred to a single library.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

Final Review of FY17 ASC CSSE L2 Milestone #6018 entitled "Analyzing Power Usage Characteristics of Workloads Running on Trinity"

Hoekstra, Robert J.; Hammond, Simon; Hemmert, Karl S.; Gentile, Ann C.; Oldfield, Ron; Lang, Mike; Martin, Steve

The presentation documented the technical approach of the team and summary of the results with sufficient detail to demonstrate both the value and the completion of the milestone. A separate SAND report was also generated with more detail to supplement the presentation.

More Details

TYPE Other Report YEAR 2017

DOI OSTI

ATDM Data Warehouse: Data Management Services for Exascale Computing

Ulmer, Craig; Oldfield, Ron; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Templet, Gary J.; Widener, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Reducing I/O variability using dynamic I/O path characterization in petascale storage systems

Journal of Supercomputing

Son, Seung W.; Sehrish, Saba; Liao, Wei K.; Oldfield, Ron; Choudhary, Alok

In petascale systems with a million CPU cores, scalable and consistent I/O performance is becoming increasingly difficult to sustain mainly because of I/O variability. The I/O variability is caused by concurrently running processes/jobs competing for I/O or a RAID rebuild when a disk drive fails. We present a mechanism that stripes across a selected subset of I/O nodes with the lightest workload at runtime to achieve the highest I/O bandwidth available in the system. In this paper, we propose a probing mechanism to enable application-level dynamic file striping to mitigate I/O variability. We implement the proposed mechanism in the high-level I/O library that enables memory-to-file data layout transformation and allows transparent file partitioning using subfiling. Subfiling is a technique that partitions data into a set of files of smaller size and manages file access to them, making data to be treated as a single, normal file to users. We demonstrate that our bandwidth probing mechanism can successfully identify temporally slower I/O nodes without noticeable runtime overhead. Experimental results on NERSC’s systems also show that our approach isolates I/O variability effectively on shared systems and improves overall collective I/O performance with less variation.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

ATDM Data Warehouse

Ulmer, Craig; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Sjaardema, Gregory D.; Templet, Gary J.; Widener, Patrick; Oldfield, Ron

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Demonstrate and Evaluate Advanced Analysis Visualization and I/O Capabilities for the SIERRA Toolkit

Oldfield, Ron; Crossno, Patricia J.; Otahal, Thomas J.; Fabian, Nathan

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

ATDM Data Management FY2015: Data Warehouse Progress Report

Ulmer, Craig; Fabian, Nathan; Kordenbrock, Todd; Mukherjee, Shyamali; Oldfield, Ron; Templet, Gary J.

The Advanced Technology Development and Mitigation (ATDM) program at Sandia National Laboratories is a new effort to build next-generation simulation codes that will map well to upcoming exascale computing platforms. Rather than follow traditional single- program, multiple data (SPMD) programming techniques, ATDM is developing applications in an asynchronous many task (AMT) form that describes work as a graph of tasks that have data dependencies. The data management team is focused on developing a data warehouse for ATDM that will enable tasks to store and exchange data objects efficiently. This report summarizes the data management teams efforts during FY15, and documents: (1) an initial API and implementation for the data warehouses key/value store, (2) API requirements for use with ATDMs runtime, (3) initial requirements for storing ATDM-specific data, and (4) the current organization of software components that will be used by the data warehouse.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Delta: Data Reduction for Integrated Application Workflows

Lofstead, Gerald F.; Jean-Baptiste, Gregory; Oldfield, Ron

Integrated Application Workflows (IAWs) run multiple simulation workflow components concurrently on an HPC resource connecting these components using compute area resources and compensating for any performance or data processing rate mismatches. These IAWs require high frequency and high volume data transfers between compute nodes and staging area nodes during the lifetime of a large parallel computation. The available network band-width between the two areas may not be enough to efficiently support the data movement. As the processing power available to compute resources increases, the requirements for this data transfer will become more difficult to satisfy and perhaps will not be satisfiable at all since network capabilities are not expanding at a comparable rate. Furthermore, energy consumption in HPC environments is expected to grow by an order of magnitude as exascale systems become a reality. The energy cost of moving large amounts of data frequently will contribute to this issue. It is necessary to reduce the volume of data without reducing the quality of data when it is being processed and analyzed. Delta resolves the issue by addressing the lifetime data transfer operations. Delta removes subsequent identical copies of already transmitted data during transfers and restores those copies once the data has reached the destination. Delta is able to identify duplicated information and determine the most space efficient way to represent it. Initial tests show about 50% reduction in data movement while maintaining the same data quality and transmission frequency.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Storage Systems and Input/Output to Support Extreme Scale Science

Ross, Robert; Grider, Gary; Felix, Evan; Gary, Mark; Klasky, Scott; Oldfield, Ron; Shipman, Galen; Wu, John

Storage systems are a foundational component of computational, experimental, and observational science today. The success of Department of Energy (DOE) activities in these areas is inextricably tied to the usability, performance, and reliability of storage and input/output (I/O) technologies.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

Publications

Search results