Astra, deployed in 2018, was the first petascale supercomputer to utilize processors based on the ARM instruction set. The system was also the first under Sandia's Vanguard program which seeks to provide an evaluation vehicle for novel technologies that with refinement could be utilized in demanding, large-scale HPC environments. In addition to ARM, several other important first-of-a-kind developments were used in the machine, including new approaches to cooling the datacenter and machine. This article documents our experiences building a power measurement and control infrastructure for Astra. While this is often beyond the control of users today, the accurate measurement, cataloging, and evaluation of power, as our experiences show, is critical to the successful deployment of a large-scale platform. While such systems exist in part for other architectures, Astra required new development to support the novel Marvell ThunderX2 processor used in compute nodes. In addition to documenting the measurement of power during system bring up and for subsequent on-going routine use, we present results associated with controlling the power usage of the processor, an area which is becoming of progressively greater interest as data centers and supercomputing sites look to improve compute/energy efficiency and find additional sources for full system optimization.
Physics-constrained machine learning is emerging as an important topic in the field of machine learning for physics. One of the most significant advantages of incorporating physics constraints into machine learning methods is that the resulting model requires significantly less data to train. By incorporating physical rules into the machine learning formulation itself, the predictions are expected to be physically plausible. Gaussian process (GP) is perhaps one of the most common methods in machine learning for small datasets. In this paper, we investigate the possibility of constraining a GP formulation with monotonicity on three different material datasets, where one experimental and two computational datasets are used. The monotonic GP is compared against the regular GP, where a significant reduction in the posterior variance is observed. The monotonic GP is strictly monotonic in the interpolation regime, but in the extrapolation regime, the monotonic effect starts fading away as one goes beyond the training dataset. Imposing monotonicity on the GP comes at a small accuracy cost, compared to the regular GP. The monotonic GP is perhaps most useful in applications where data are scarce and noisy, and monotonicity is supported by strong physical evidence.
Integrated computational materials engineering (ICME) models have been a crucial building block for modern materials development, relieving heavy reliance on experiments and significantly accelerating the materials design process. However, ICME models are also computationally expensive, particularly with respect to time integration for dynamics, which hinders the ability to study statistical ensembles and thermodynamic properties of large systems for long time scales. To alleviate the computational bottleneck, we propose to model the evolution of statistical microstructure descriptors as a continuous-time stochastic process using a non-linear Langevin equation, where the probability density function (PDF) of the statistical microstructure descriptors, which are also the quantities of interests (QoIs), is modeled by the Fokker-Planck equation. We discuss how to calibrate the drift and diffusion terms of the Fokker-Planck equation from the theoretical and computational perspectives. The calibrated Fokker-Planck equation can be used as a stochastic reduced-order model to simulate the microstructure evolution of statistical microstructure descriptors PDF. Considering statistical microstructure descriptors in the microstructure evolution as QoIs, we demonstrate our proposed methodology in three integrated computational materials engineering (ICME) models: kinetic Monte Carlo, phase field, and molecular dynamics simulations.
Uncertainty quantification (UQ) plays a major role in verification and validation for computational engineering models and simulations, and establishes trust in the predictive capability of computational models. In the materials science and engineering context, where the process-structure-property-performance linkage is well known to be the only road mapping from manufacturing to engineering performance, numerous integrated computational materials engineering (ICME) models have been developed across a wide spectrum of length-scales and time-scales to relieve the burden of resource-intensive experiments. Within the structure-property linkage, crystal plasticity finite element method (CPFEM) models have been widely used since they are one of a few ICME toolboxes that allows numerical predictions, providing the bridge from microstructure to materials properties and performances. Several constitutive models have been proposed in the last few decades to capture the mechanics and plasticity behavior of materials. While some UQ studies have been performed, the robustness and uncertainty of these constitutive models have not been rigorously established. In this work, we apply a stochastic collocation (SC) method, which is mathematically rigorous and has been widely used in the field of UQ, to quantify the uncertainty of three most commonly used constitutive models in CPFEM, namely phenomenological models (with and without twinning), and dislocation-density-based constitutive models, for three different types of crystal structures, namely face-centered cubic (fcc) copper (Cu), body-centered cubic (bcc) tungsten (W), and hexagonal close packing (hcp) magnesium (Mg). Our numerical results not only quantify the uncertainty of these constitutive models in stress-strain curve, but also analyze the global sensitivity of the underlying constitutive parameters with respect to the initial yield behavior, which may be helpful for robust constitutive model calibration works in the future.
The Spent Fuel and Waste Science and Technology (SFWST) Campaign of the U.S. Department of Energy (DOE) Office of Nuclear Energy (NE), Office of Spent Fuel & Waste Disposition (SFWD) is conducting research and development (R&D) on geologic disposal of spent nuclear fuel (SNF) and high-level nuclear waste (HLW). A high priority for SFWST disposal R&D is disposal system modeling (Sassani et al. 2021). The SFWST Geologic Disposal Safety Assessment (GDSA) work package is charged with developing a disposal system modeling and analysis capability for evaluating generic disposal system performance for nuclear waste in geologic media. This report describes fiscal year (FY) 2022 advances of the Geologic Disposal Safety Assessment (GDSA) performance assessment (PA) development groups of the SFWST Campaign. The common mission of these groups is to develop a geologic disposal system modeling capability for nuclear waste that can be used to assess probabilistically the performance of generic disposal options and generic sites. The modeling capability under development is called GDSA Framework (pa.sandia.gov). GDSA Framework is a coordinated set of codes and databases designed for probabilistically simulating the release and transport of disposed radionuclides from a repository to the biosphere for post-closure performance assessment. Primary components of GDSA Framework include PFLOTRAN to simulate the major features, events, and processes (FEPs) over time, Dakota to propagate uncertainty and analyze sensitivities, meshing codes to define the domain, and various other software for rendering properties, processing data, and visualizing results.
The Spent Fuel and Waste Science and Technology (SFWST) Campaign of the U.S. Department of Energy Office of Nuclear Energy, Office of Spent Fuel and Waste Disposition (SFWD), has been conducting research and development on generic deep geologic disposal systems (i.e., geologic repositories). This report describes specific activities in the Fiscal Year (FY) 2022 associated with the Geologic Disposal Safety Assessment (GDSA) Repository Systems Analysis (RSA) work package within the SFWST Campaign. The overall objective of the GDSA RSA work package is to develop generic deep geologic repository concepts and system performance assessment (PA) models in several host-rock environments, and to simulate and analyze these generic repository concepts and models using the GDSA Framework toolkit, and other tools as needed.
Making reliable predictions in the presence of uncertainty is critical to high-consequence modeling and simulation activities, such as those encountered at Sandia National Laboratories. Surrogate or reduced-order models are often used to mitigate the expense of performing quality uncertainty analyses with high-fidelity, physics-based codes. However, phenomenological surrogate models do not always adhere to important physics and system properties. This project develops surrogate models that integrate physical theory with experimental data through a maximally-informative framework that accounts for the many uncertainties present in computational modeling problems. Correlations between relevant outputs are preserved through the use of multi-output or co-predictive surrogate models; known physical properties (specifically monotoncity) are also preserved; and unknown physics and phenomena are detected using a causal analysis. By endowing surrogate models with key properties of the physical system being studied, their predictive power is arguably enhanced, allowing for reliable simulations and analyses at a reduced computational cost.
This document provides very basic background information and initial enabling guidance for computational analysts to develop and utilize GitOps practices within the Common Engineering Environment (CEE) and High Performance Computing (HPC) computational environment at Sandia National Laboratories through GitLab/Jacamar runner based workflows.
Modeling real-world phenomena to any degree of accuracy is a challenge that the scientific research community has navigated since its foundation. Lack of information and limited computational and observational resources necessitate modeling assumptions which, when invalid, lead to model-form error (MFE). The work reported herein explored a novel method to represent model-form uncertainty (MFU) that combines Bayesian statistics with the emerging field of universal differential equations (UDEs). The fundamental principle behind UDEs is simple: use known equational forms that govern a dynamical system when you have them; then incorporate data-driven approaches – in this case neural networks (NNs) – embedded within the governing equations to learn the interacting terms that were underrepresented. Utilizing epidemiology as our motivating exemplar, this report will highlight the challenges of modeling novel infectious diseases while introducing ways to incorporate NN approximations to MFE. Prior to embarking on a Bayesian calibration, we first explored methods to augment the standard (non-Bayesian) UDE training procedure to account for uncertainty and increase robustness of training. In addition, it is often the case that uncertainty in observations is significant; this may be due to randomness or lack of precision in the measurement process. This uncertainty typically manifests as “noisy” observations which deviate from a true underlying signal. To account for such variability, the NN approximation to MFE is endowed with a probabilistic representation and is updated using available observational data in a Bayesian framework. By representing the MFU explicitly and deploying an embedded, data-driven model, this approach enables an agile, expressive, and interpretable method for representing MFU. In this report we will provide evidence that Bayesian UDEs show promise as a novel framework for any science-based, data-driven MFU representation; while emphasizing that significant advances must be made in the calibration of Bayesian NNs to ensure a robust calibration procedure.
Predictive design of REHEDS experiments with radiation-hydrodynamic simulations requires knowledge of material properties (e.g. equations of state (EOS), transport coefficients, and radiation physics). Interpreting experimental results requires accurate models of diagnostic observables (e.g. detailed emission, absorption, and scattering spectra). In conditions of Local Thermodynamic Equilibrium (LTE), these material properties and observables can be pre-computed with relatively high accuracy and subsequently tabulated on simple temperature-density grids for fast look-up by simulations. When radiation and electron temperatures fall out of equilibrium, however, non-LTE effects can profoundly change material properties and diagnostic signatures. Accurately and efficiently incorporating these non-LTE effects has been a longstanding challenge for simulations. At present, most simulations include non-LTE effects by invoking highly simplified inline models. These inline non-LTE models are both much slower than table look-up and significantly less accurate than the detailed models used to populate LTE tables and diagnose experimental data through post-processing or inversion. Because inline non-LTE models are slow, designers avoid them whenever possible, which leads to known inaccuracies from using tabular LTE. Because inline models are simple, they are inconsistent with tabular data from detailed models, leading to ill-known inaccuracies, and they cannot generate detailed synthetic diagnostics suitable for direct comparisons with experimental data. This project addresses the challenge of generating and utilizing efficient, accurate, and consistent non-equilibrium material data along three complementary but relatively independent research lines. First, we have developed a relatively fast and accurate non-LTE average-atom model based on density functional theory (DFT) that provides a complete set of EOS, transport, and radiative data, and have rigorously tested it against more sophisticated first-principles multi-atom DFT models, including time-dependent DFT. Next, we have developed a tabular scheme and interpolation methods that compactly capture non-LTE effects for use in simulations and have implemented these tables in the GORGON magneto-hydrodynamic (MHD) code. Finally, we have developed post-processing tools that use detailed tabulated non-LTE data to directly predict experimental observables from simulation output.
The Spent Fuel and Waste Science and Technology (SFWST) Campaign of the U.S. Department of Energy (DOE) Office of Nuclear Energy (NE), Office of Fuel Cycle Technology (FCT) is conducting research and development (R&D) on geologic disposal of spent nuclear fuel (SNF) and high-level nuclear waste (HLW). Two high priorities for SFWST disposal R&D are design concept development and disposal system modeling. These priorities are directly addressed in the SFWST Geologic Disposal Safety Assessment (GDSA) control account, which is charged with developing a geologic repository system modeling and analysis capability, and the associated software, GDSA Framework, for evaluating disposal system performance for nuclear waste in geologic media. GDSA Framework is supported by SFWST Campaign and its predecessor the Used Fuel Disposition (UFD) campaign.
An approach to numerically modeling relativistic magnetrons, in which the electrons are represented with a relativistic fluid, is described. A principal effect in the operation of a magnetron is space-charge-limited (SCL) emission of electrons from the cathode. We have developed an approximate SCL emission boundary condition for the fluid electron model. This boundary condition prescribes the flux of electrons as a function of the normal component of the electric field on the boundary. We show the results of a benchmarking activity that applies the fluid SCL boundary condition to the one-dimensional Child-Langmuir diode problem and a canonical two-dimensional diode problem. Simulation results for a two-dimensional A6 magnetron are then presented. Computed bunching of the electron cloud occurs and coincides with significant microwave power generation. Numerical convergence of the solution is considered. Sharp gradients in the solution quantities at the diocotron resonance, spanning an interval of three to four grid cells in the most well-resolved case, are present and likely affect convergence.
A semi-analytic fluid model has been developed for characterizing relativistic electron emission across a warm diode gap. Here we demonstrate the use of this model in (i) verifying multi-fluid codes in modeling compressible relativistic electron flows (the EMPIRE-Fluid code is used as an example; see also Ref. 1), (ii) elucidating key physics mechanisms characterizing the influence of compressibility and relativistic injection speed of the electron flow, and (iii) characterizing the regimes over which a fluid model recovers physically reasonable solutions.
This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.
Adams, Brian H.; Bohnhoff, William J.; Dalbey, Keith R.; Ebeida, Mohamed S.; Eddy, John P.; Eldred, Michael S.; Hooper, Russell W.; Hough, Patricia D.; Hu, Kenneth T.; Jakeman, John D.; Khalil, Mohammad; Maupin, Kathryn A.; Monschke, Jason A.; Ridgway, Elliott M.; Rushdi, Ahmad A.; Seidl, Daniel T.; Stephens, John A.; Swiler, Laura P.; Laros, James H.; Winokur, Justin G.
The Dakota toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the Dakota toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user's manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.
Reverse engineering (RE) analysts struggle to address critical questions about the safety of binary code accurately and promptly, and their supporting program analysis tools are simply wrong sometimes. The analysis tools have to approximate in order to provide any information at all, but this means that they introduce uncertainty into their results. And those uncertainties chain from analysis to analysis. We hypothesize that exposing sources, impacts, and control of uncertainty to human binary analysts will allow the analysts to approach their hardest problems with high-powered analytic techniques that they know when to trust. Combining expertise in binary analysis algorithms, human cognition, uncertainty quantification, verification and validation, and visualization, we pursue research that should benefit binary software analysis efforts across the board. We find a strong analogy between RE and exploratory data analysis (EDA); we begin to characterize sources and types of uncertainty found in practice in RE (both in the process and in supporting analyses); we explore a domain-specific focus on uncertainty in pointer analysis, showing that more precise models do help analysts answer small information flow questions faster and more accurately; and we test a general population with domain-general sudoku problems, showing that adding "knobs" to an analysis does not significantly slow down performance. This document describes our explorations in uncertainty in binary analysis.
Bayesian optimization (BO) is an efficient and flexible global optimization framework that is applicable to a very wide range of engineering applications. To leverage the capability of the classical BO, many extensions, including multi-objective, multi-fidelity, parallelization, and latent-variable modeling, have been proposed to address the limitations of the classical BO framework. In this work, we propose a novel multi-objective BO formalism, called srMO-BO-3GP, to solve multi-objective optimization problems in a sequential setting. Three different Gaussian processes (GPs) are stacked together, where each of the GPs is assigned with a different task. The first GP is used to approximate a single-objective computed from the multi-objective definition, the second GP is used to learn the unknown constraints, and the third one is used to learn the uncertain Pareto frontier. At each iteration, a multi-objective augmented Tchebycheff function is adopted to convert multi-objective to single-objective, where the regularization with a regularized ridge term is also introduced to smooth the single-objective function. Finally, we couple the third GP along with the classical BO framework to explore the convergence and diversity of the Pareto frontier by the acquisition function for exploitation and exploration. The proposed framework is demonstrated using several numerical benchmark functions, as well as a thermomechanical finite element model for flip-chip package design optimization.
Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.
This report details the results of a three-fold investigation of sensitivity analysis (SA) for machine learning (ML) explainability (MLE): (1) the mathematical assessment of the fidelity of an explanation with respect to a learned ML model, (2) quantifying the trustworthiness of a prediction, and (3) the impact of MLE on the efficiency of end-users through multiple users studies. We focused on the cybersecurity domain as the data is inherently non-intuitive. As ML is being using in an increasing number of domains, including domains where being wrong can elicit high consequences, MLE has been proposed as a means of generating trust in a learned ML models by end users. However, little analysis has been performed to determine if the explanations accurately represent the target model and they themselves should be trusted beyond subjective inspection. Current state-of-the-art MLE techniques only provide a list of important features based on heuristic measures and/or make certain assumptions about the data and the model which are not representative of the real-world data and models. Further, most are designed without considering the usefulness by an end-user in a broader context. To address these issues, we present a notion of explanation fidelity based on Shapley values from cooperative game theory. We find that all of the investigated MLE explainability methods produce explanations that are incongruent with the ML model that is being explained. This is because they make critical assumptions about feature independence and linear feature interactions for computational reasons. We also find that in deployed, explanations are rarely used due to a variety of reason including that there are several other tools which are trusted more than the explanations and there is little incentive to use the explanations. In the cases when the explanations are used, we found that there is the danger that explanations persuade the end users to wrongly accept false positives and false negatives. However, ML model developers and maintainers find the explanations more useful to help ensure that the ML model does not have obvious biases. In light of these findings, we suggest a number of future directions including developing MLE methods that directly model non-linear model interactions and including design principles that take into account the usefulness of explanations to the end user. We also augment explanations with a set of trustworthiness measures that measure geometric aspects of the data to determine if the model output should be trusted.
Ship tracks are quasi-linear cloud patterns produced from the interaction of ship emissions with low boundary layer clouds. They are visible throughout the diurnal cycle in satellite images from space-borne assets like the Advanced Baseline Imagers (ABI) aboard the National Oceanic and Atmospheric Administration Geostationary Operational Environmental Satellites (GOES-R). However, complex atmospheric dynamics often make it difficult to identify and characterize the formation and evolution of tracks. Ship tracks have the potential to increase a cloud's albedo and reduce the impact of global warming. Thus, it is important to study these patterns to better understand the complex atmospheric interactions between aerosols and clouds to improve our climate models, and examine the efficacy of climate interventions, such as marine cloud brightening. Over the course of this 3-year project, we have developed novel data-driven techniques that advance our ability to assess the effects of ship emissions on marine environments and the risks of future marine cloud brightening efforts. The three main innovative technical contributions we will document here are a method to track aerosol injections using optical flow, a stochastic simulation model for track formations and an automated detection algorithm for efficient identification of ship tracks in large datasets.