In this paper, we introduce and analyze a new class of optimal control problems constrained by elliptic equations with uncertain fractional exponents. We utilize risk measures to formulate the resulting optimization problem. We develop a functional analytic framework, study the existence of solution, and rigorously derive the first-order optimality conditions. Additionally, we employ a sample-based approximation for the uncertain exponent and the finite element method to discretize in space. We prove the rate of convergence for the optimal risk neutral controls when using quadrature approximation for the uncertain exponent and conclude with illustrative examples.
We propose a vertical TFET using atomic precision advanced manufacturing (APAM) to create an abrupt buried n++-doped source. We developed a gate stack that preserves the APAM source to accumulate holes above it, with a goal of band-to-band tunneling (BTBT) perpendicular to the gate – critical for the proposed device. A metal-insulator-semiconductor (MIS) capacitor shows hole accumulation above the APAM source, corroborated by simulation, demonstrating the TFET’s feasibility.
Interval Assignment (IA) is the problem of selecting the number of mesh edges (intervals) for each curve for conforming quad and hex meshing. The intervals x is fundamentally integer-valued, yet many approaches perform floating-point optimization and convert a floating-point solution into an integer solution. We avoid such steps: we start integer, stay integer. Incremental Interval Assignment (IIA) uses integer linear algebra (Hermite normal form) to find an initial solution to the matrix equation Ax = b satisfying the meshing constraints. Solving for reduced row echelon form provides integer vectors spanning the nullspace of A. We add vectors from the nullspace to improve the initial solution. Compared to floating-point optimization approaches, IIA is faster and always produces an integer solution. The potential drawback is that there is no theoretical guarantee that the solution is optimal, but in practice we achieve solutions close to the user goals. The software is freely available.
Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting gains of second-order methods over gradient descent. A Tensorflow implementation of the algorithm is available at github.com/rgp62/.
Boolean functions and binary arithmetic operations are central to standard computing paradigms. Accordingly, many advances in computing have focused upon how to make these operations more efficient as well as exploring what they can compute. To best leverage the advantages of novel computing paradigms it is important to consider what unique computing approaches they offer. However, for any special-purpose co-processor, Boolean functions and binary arithmetic operations are useful for, among other things, avoiding unnecessary I/O on-and-off the co-processor by pre- and post-processing data on-device. This is especially true for spiking neuromorphic architectures where these basic operations are not fundamental low-level operations. Instead, these functions require specific implementation. Here we discuss the implications of an advantageous streaming binary encoding method as well as a handful of circuits designed to exactly compute elementary Boolean and binary operations.
CSPlib is an open source software library for analyzing general ordinary differential equation (ODE) systems and detailed chemical kinetic ODE systems. It relies on the computational singular perturbation (CSP) method for the analysis of these systems. The software provides support for: General ODE models (gODE model class) for computing source terms and Jacobians for a generic ODE system; TChem model (ChemElemODETChem model class) for computing source term, Jacobian, other necessary chemical reaction data, as well as the rates of progress for a homogenous batch reactor using an elementary step detailed chemical kinetic reaction mechanism. This class relies on the TChem [2] library; A set of functions to compute essential elements of CSP analysis (Kernel class). This includes computations of the eigensolution of the Jacobian matrix, CSP basis vectors and co-vectors, time scales (reciprocals of the magnitudes of the Jacobian eigenvalues), mode amplitudes, CSP pointers, and the number of exhausted modes. This class relies on the Tines library; A set of functions to compute the eigensolution of the Jacobian matrix using Tines library GPU eigensolver; A set of functions to compute CSP indices (Index Class). This includes participation indices and both slow and fast importance indices.
In this study, a complete inelastic equation of state (IEOS) for solids is developed based on a superposition of thermodynamic energy potentials. The IEOS allows for a tensorial stress state by including an isochoric hyperelastic Helmholtz potential in addition to the zero-kelvin isotherm and lattice vibration energy contributions. Inelasticity is introduced through the nonlinear equations of finite strain plasticity which utilize the temperature dependent Johnson–Cook yield model. Material failure is incorporated into the model by a coupling of the damage history variable to the energy potentials. The numerical evaluation of the IEOS requires a nonlinear solution of stress, temperature and history variables associated with elastic trial states for stress and temperature. The model is implemented into the ALEGRA shock and multi-physics code and the applications presented include single element deformation paths, the Taylor anvil problem and an energetically driven thermo-mechanical problem.
We present a new evaluation framework for implicit and explicit (IMEX) Runge–Kutta time-stepping schemes. The new framework uses a linearized nonhydrostatic system of normal modes. We utilize the framework to investigate the stability of IMEX methods and their dispersion and dissipation of gravity, Rossby, and acoustic waves. We test the new framework on a variety of IMEX schemes and use it to develop and analyze a set of second-order low-storage IMEX Runge–Kutta methods with a high Courant–Friedrichs–Lewy (CFL) number. We show that the new framework is more selective than the 2-D acoustic system previously used in the literature. Schemes that are stable for the 2-D acoustic system are not stable for the system of normal modes.
We present a new evaluation framework for implicit and explicit (IMEX) Runge-Kutta time-stepping schemes. The new framework uses a linearized nonhydrostatic system of normal modes. We utilize the framework to investigate the stability of IMEX methods and their dispersion and dissipation of gravity, Rossby, and acoustic waves. We test the new framework on a variety of IMEX schemes and use it to develop and analyze a set of second-order low-storage IMEX Runge-Kutta methods with a high Courant-Friedrichs-Lewy (CFL) number. We show that the new framework is more selective than the 2-D acoustic system previously used in the literature. Schemes that are stable for the 2-D acoustic system are not stable for the system of normal modes.
Chris Saunders and three technologists are in high demand from Sandia’s deep learning teams, and they’re kept busy by building new clusters of computer nodes for researchers who need the power of supercomputing on a smaller scale. Sandia researchers working on Laboratory Directed Research & Development (LDRD) projects, or innovative ideas for solutions on short timeframes, formulate new ideas on old themes and frequently rely on smaller cluster machines to help solve problems before introducing their code to larger HPC resources. These research teams need an agile hardware and software environment where nascent ideas can be tested and cultivated on a smaller scale.
Accurate and timely weather predictions are critical to many aspects of society with a profound impact on our economy, general well-being, and national security. In particular, our ability to forecast severe weather systems is necessary to avoid injuries and fatalities, but also important to minimize infrastructure damage and maximize mitigation strategies. The weather community has developed a range of sophisticated numerical models that are executed at various spatial and temporal scales in an attempt to issue global, regional, and local forecasts in pseudo real time. The accuracy however depends on the time period of the forecast, the nonlinearities of the dynamics, and the target spatial resolution. Significant uncertainties plague these predictions including errors in initial conditions, material properties, data, and model approximations. To address these shortcomings, a continuous data collection occurs at an effort level that is even larger than the modeling process. It has been demonstrated that the accuracy of the predictions depends on the quality of the data and is independent to a certain extent on the sophistication of the numerical models. Data assimilation has become one of the more critical steps in the overall weather prediction business and consequently substantial improvements in the quality of the data would have transformational benefits. This paper describes the use of infrasound inversion technology, enabled through exascale computing, that could potentially achieve orders of magnitude improvement in data quality and therefore transform weather predictions with significant impact on many aspects of our society.
After decades of R&D, quantum computers comprising more than 2 qubits are appearing. If this progress is to continue, the research community requires a capability for precise characterization (“tomography”) of these enlarged devices, which will enable benchmarking, improvement, and finally certification as mission-ready. As world leaders in characterization -- our gate set tomography (GST) method is the current state of the art – the project team is keenly aware that every existing protocol is either (1) catastrophically inefficient for more than 2 qubits, or (2) not rich enough to predict device behavior. GST scales poorly, while the popular randomized benchmarking technique only measures a single aggregated error probability. This project explored a new insight: that the combinatorial explosion plaguing standard GST could be avoided by using an ansatz of few-qubit interactions to build a complete, efficient model for multi-qubit errors. We developed this approach, prototyped it, and tested it on a cutting-edge quantum processor developed by Rigetti Quantum Computing (RQC), a US-based startup. We implemented our new models within Sandia’s PyGSTi open-source code, and tested them experimentally on the RQC device by probing crosstalk. We found two major results: first, our schema worked and is viable for further development; second, while the Rigetti device is indeed a “real” 8-qubit quantum processor, its behavior fluctuated significantly over time while we were experimenting with it and this drift made it difficult to fit our models of crosstalk to the data.
Hoekstra, Robert J.; Malone, C.M.; Montoya, D.R.; Ferencz, M.R.; Kuhl, A.L.; Wagner, J.
The review was conducted on May 8-9, 2017 at the University of Utah. Overall the review team was impressed with the work presented and found that the CCMSC had met or exceeded the Year 3 milestones. Specific details, comments, and recommendations are included in this document.
This article concerns modeling unsaturated deformable porous media as an equivalent single-phase and single-force state peridynamic material through the effective force state. The balance equations of linear momentum and mass of unsaturated porous media are presented by defining relevant peridynamic states. The energy balance of unsaturated porous media is utilized to derive the effective force state for the solid skeleton that is an energy conjugate to the nonlocal deformation state of the solid, and the suction force state. Through an energy equivalence, a multiphase constitutive correspondence principle is built between classical unsaturated poromechanics and peridynamic unsaturated poromechanics. The multiphase correspondence principle provides a means to incorporate advanced constitutive models in classical unsaturated porous theory directly into unsaturated peridynamic poromechanics. Numerical simulations of localized failure in unsaturated porous media under different matric suctions are presented to demonstrate the feasibility of modeling the mechanical behavior of such three-phase materials as an equivalent single-phase peridynamic material through the effective force state concept.
If quantum information processors are to fulfill their potential, the diverse errors that affect them must be understood and suppressed. But errors typically fluctuate over time, and the most widely used tools for characterizing them assume static error modes and rates. This mismatch can cause unheralded failures, misidentified error modes, and wasted experimental effort. Here, we demonstrate a spectral analysis technique for resolving time dependence in quantum processors. Our method is fast, simple, and statistically sound. It can be applied to time-series data from any quantum processor experiment. We use data from simulations and trapped-ion qubit experiments to show how our method can resolve time dependence when applied to popular characterization protocols, including randomized benchmarking, gate set tomography, and Ramsey spectroscopy. In the experiments, we detect instability and localize its source, implement drift control techniques to compensate for this instability, and then demonstrate that the instability has been suppressed.
Lozanovski, Bill; Downing, David; Tino, Rance; Du Plessis, Anton; Tran, Phuong; Jakeman, John D.; Shidid, Darpan; Emmelmann, Claus; Qian, Ma; Choong, Peter; Brandt, Milan; Leary, Martin
Additive Manufacturing (AM), commonly referred to as 3D printing, offers the ability to not only fabricate geometrically complex lattice structures but parts in which lattice topologies in-fill volumes bounded by complex surface geometries. However, current AM processes produce defects on the strut and node elements which make up the lattice structure. This creates an inherent difference between the as-designed and as-fabricated geometries, which negatively affects predictions (via numerical simulation) of the lattice's mechanical performance. Although experimental and numerical analysis of an AM lattice's bulk structure, unit cell and struts have been performed, there exists almost no research data on the mechanical response of the individual as-manufactured lattice node elements. This research proposes a methodology that, for the first time, allows non-destructive quantification of the mechanical response of node elements within an as-manufactured lattice structure. A custom-developed tool is used to extract and classify each individual node geometry from micro-computed tomography scans of an AM fabricated lattice. Voxel-based finite element meshes are generated for numerical simulation and the mechanical response distribution is compared to that of the idealised computer-aided design model. The method demonstrates compatibility with Uncertainty Quantification methods that provide opportunities for efficient prediction of a population of nodal responses from sampled data. Overall, the non-destructive and automated nature of the node extraction and response evaluation is promising for its application in qualification and certification of additively manufactured lattice structures.
Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.
Parallel implementations of linear iterative solvers generally alternate between phases of data exchange and phases of local computation. Increasingly large problem sizes and more heterogeneous compute architectures make load balancing and the design of low latency network interconnects that are able to satisfy the communication requirements of linear solvers very challenging tasks. In particular, global communication patterns such as inner products become increasingly limiting at scale. We explore the use of asynchronous communication based on one-sided Message Passing Interface primitives in the context of domain decomposition solvers. In particular, a scalable asynchronous two-level Schwarz method is presented. We discuss practical issues encountered in the development of a scalable solver and show experimental results obtained on a state-of-the-art supercomputer system that illustrate the benefits of asynchronous solvers in load balanced as well as load imbalanced scenarios. Using the novel method, we can observe speedups of up to four times over its classical synchronous equivalent.
One of the most severe obstacles to increasing the longevity of tungsten-based plasma facing components, such as divertor tiles, is the surface deterioration driven by sub-surface helium bubble formation and rupture. Supported by experimental observations at PISCES, this work uses molecular dynamics simulations to identify the microscopic mechanisms underlying suppression of helium bubble formation by the introduction of plasma-borne beryllium. Simulations of the initial surface material (crystalline W), early-time Be exposure (amorphous W-Be) and final WBe2 intermetallic surfaces were used to highlight the effect of Be. Significant differences in He retention, depth distribution and cluster size were observed in the cases with beryllium present. Helium resided much closer to the surface in the Be cases with nearly 80% of the total helium inventory located within the first 2 nm. Moreover, coarsening of the He depth profile due to bubble formation is suppressed due to a one-hundred fold decrease in He mobility in WBe2, relative to crystalline W. This is further evidenced by the drastic reduction in He cluster sizes even when it was observed that both the amorphous W-Be and WBe2 intermetallic phases retain nearly twice as much He during cumulative implantation studies.
In support of analyst requests for Mobile Guardian Transport studies, researchers at Sandia National Laboratories have expanded data types for the Slycat ensemble-analysis and visualization tool to include 3D surface meshes. This new capability represents a significant advance in our ability to perform detailed comparative analysis of simulation results. Analyzing mesh data rather than images provides greater flexibility for post-processing exploratory analysis.
This report details work to study trade-offs in topology and network bandwidth for potential interconnects in the exascale (2021-2022) timeframe. The work was done using multiple interconnect models across two parallel discrete event simulators. Results from each independent simulator are shown and discussed and the areas of agreement and disagreement are explored.
This paper presents a multifidelity uncertainty quantification framework called MFNets. We seek to address three existing challenges that arise when experimental and simulation data from different sources are used to enhance statistical estimation and prediction with quantified uncertainty. Specifically, we demonstrate that MFNets can (1) fuse heterogeneous data sources arising from simulations with different parameterizations, e.g simulation models with different uncertain parameters or data sets collected under different environmental conditions; (2) encode known relationships among data sources to reduce data requirements; and (3) improve the robustness of existing multi-fidelity approaches to corrupted data. MFNets construct a network of latent variables (LVs) to facilitate the fusion of data from an ensemble of sources of varying credibility and cost. These LVs are posited as explanatory variables that provide the source of correlation in the observed data. Furthermore, MFNets provide a way to encode prior physical knowledge to enable efficient estimation of statistics and/or construction of surrogates via conditional independence relations on the LVs. We highlight the utility of our framework with a number of theoretical results which assess the quality of the posterior mean as a frequentist estimator and compare it to standard sampling approaches that use single fidelity, multilevel, and control variate Monte Carlo estimators. We also use the proposed framework to derive the Monte Carlo-based control variate estimator entirely from the use of Bayes rule and linear-Gaussian models -- to our knowledge the first such derivation. Finally, we demonstrate the ability to work with different uncertain parameters across different models.
Alemazkoor, Negin; Rachunok, Benjamin; Chavas, Daniel R.; Staid, Andrea S.; Louhghalam, Arghavan; Nateghi, Roshanak; Tootkaboni, Mazdak
Nine in ten major outages in the US have been caused by hurricanes. Long-term outage risk is a function of climate change-triggered shifts in hurricane frequency and intensity; yet projections of both remain highly uncertain. However, outage risk models do not account for the epistemic uncertainties in physics-based hurricane projections under climate change, largely due to the extreme computational complexity. Instead they use simple probabilistic assumptions to model such uncertainties. Here, we propose a transparent and efficient framework to, for the first time, bridge the physics-based hurricane projections and intricate outage risk models. We find that uncertainty in projections of the frequency of weaker storms explains over 95% of the uncertainty in outage projections; thus, reducing this uncertainty will greatly improve outage risk management. We also show that the expected annual fraction of affected customers exhibits large variances, warranting the adoption of robust resilience investment strategies and climate-informed regulatory frameworks.
Partial differential equations (PDEs) are used with huge success to model phenomena across all scientific and engineering disciplines. However, across an equally wide swath, there exist situations in which PDEs fail to adequately model observed phenomena, or are not the best available model for that purpose. On the other hand, in many situations, nonlocal models that account for interaction occurring at a distance have been shown to more faithfully and effectively model observed phenomena that involve possible singularities and other anomalies. Here, we consider a generic nonlocal model, beginning with a short review of its definition, the properties of its solution, its mathematical analysis and of specific concrete examples. We then provide extensive discussions about numerical methods, including finite element, finite difference and spectral methods, for determining approximate solutions of the nonlocal models considered. In that discussion, we pay particular attention to a special class of nonlocal models that are the most widely studied in the literature, namely those involving fractional derivatives. The article ends with brief considerations of several modelling and algorithmic extensions, which serve to show the wide applicability of nonlocal modelling.
Tranchida, Julien G.; Dos Santos, Gonzalo; Aparicio, Romina; Linares, D.; Miranda, E.N.; Pastor, Gustavo M.; Bringa, Eduardo M.
In this paper, the magnetic behavior of bcc iron nanoclusters, with diameters between 2 and 8 nm, is investigated by means of spin dynamics simulations coupled to molecular dynamics, using a distance-dependent exchange interaction. Finite-size effects in the total magnetization as well as the influence of the free surface and the surface/core proportion of the nanoclusters are analyzed in detail for a wide temperature range, going beyond the cluster and bulk Curie temperatures. Comparison is made with experimental data and with theoretical models based on the mean-field Ising model adapted to small clusters, and taking into account the influence of low coordinated spins at free surfaces. Our results for the temperature dependence of the average magnetization per atom M (T), including the thermalization of the transnational lattice degrees of freedom, are in very good agreement with available experimental measurements on small Fe nanoclusters. In contrast, significant discrepancies with experiment are observed if the translational degrees of freedom are artificially frozen. The finite-size effects on M (T) are found to be particularly important near the cluster Curie temperature. Simulated magnetization above the Curie temperature scales with cluster size as predicted by models assuming short-range magnetic ordering. Analytical approximations to the magnetization as a function of temperature and size are proposed.
Antz, Hartwig; Boman, Erik G.; Gates, Mark; Kruger, Scott; Li, Sherry; Loe, Jennifer A.; Osei-Kuffuor, Daniel; Tomov, Stan; Tsai, Yaohung M.; Meier Yang, Ulrike
The use of multiple types of precision in mathematical software has the potential to increase its performance on new heterogeneous architectures. The xSDK project focuses both on the investigation and development of multiprecision algorithms as well as their inclusion into xSDK member libraries. This report summarizes current efforts on including and/or using mixed precision capabilities in the math libraries Ginkgo, heFFTe, hypre, MAGMA, PETSc/TAO, SLATE, SuperLU, and Trilinos, including KokkosKernels. It contains both numerical results from libraries that already provide mixed precision capabilities, as well as descriptions of the strategies to incorporate multiprecision into established libraries.
Neuromorphic computing is a critical future technology for the computing industry, but it has yet to achieve its promise and has struggled to establish a cohesive research community. A large part of the challenge is that full realization of the potential of brain inspiration requires advances in both device hardware, computing architectures, and algorithms. This simultaneous development across technology scales is unprecedented in the computing field. This article presents a strategy, framed by market and policy pressures, for moving past these current technological and cultural hurdles to realize its full impact across technology. Achieving the full potential of brain-derived algorithms as well as post-complementary metal-oxide-semiconductor (CMOS) scaling neuromorphic hardware requires appropriately balancing the near-term opportunities of deep learning applications with the long-term potential of less understood opportunities in neural computing.
The application of deep learning toward discovery of data-driven models requires careful application of inductive biases to obtain a description of physics which is both accurate and robust. We present here a framework for discovering continuum models from high fidelity molecular simulation data. Our approach applies a neural network parameterization of governing physics in modal space, allowing a characterization of differential operators while providing structure which may be used to impose biases related to symmetry, isotropy, and conservation form. Here, we demonstrate the effectiveness of our framework for a variety of physics, including local and nonlocal diffusion processes and single and multiphase flows. For the flow physics we demonstrate this approach leads to a learned operator that generalizes to system characteristics not included in the training sets, such as variable particle sizes, densities, and concentration.
Here, we describe recent efforts to improve our predictive modeling of rate-dependent behavior at, or near, a phase transition using molecular dynamics simulations. Cadmium sulfide (CdS) is a well-studied material that undergoes a solid-solid phase transition from wurtzite to rock salt structures between 3 and 9 GPa. Atomistic simulations are used to investigate the dominant transition mechanisms as a function of orientation, size and rate. We found that the final rock salt orientations were determined relative to the initial wurtzite orientation, and that these orientations were different for the two orientations and two pressure regimes studied. The CdS solid-solid phase transition is studied, for both a bulk single crystal and for polymer-encapsulated spherical nanoparticles of various sizes.
Signal arrival-time estimation plays a critical role in a variety of downstream seismic analyses, including location estimation and source characterization. Any arrival-time errors propagate through subsequent data-processing results. In this article, we detail a general framework for refining estimated seismic signal arrival times along with full estimation of their associated uncertainty. Using the standard short-term average/long-term average threshold algorithm to identify a search window, we demonstrate how to refine the pick estimate through two different approaches. In both cases, new waveform realizations are generated through bootstrap algorithms to produce full a posteriori estimates of uncertainty of onset arrival time of the seismic signal. The onset arrival uncertainty estimates provide additional data-derived information from the signal and have the potential to influence seismic analysis along several fronts.
Component coupling is a crucial part of climate models, such as DOE's E3SM (Caldwell et al., 2019). A common coupling strategy in climate models is for their components to exchange flux data from the previous time-step. This approach effectively performs a single step of an iterative solution method for the monolithic coupled system, which may lead to instabilities and loss of accuracy. In this paper we formulate an Interface-Flux-Recovery (IFR) coupling method which improves upon the conventional coupling techniques in climate models. IFR starts from a monolithic formulation of the coupled discrete problem and then uses a Schur complement to obtain an accurate approximation of the flux across the interface between the model components. This decouples the individual components and allows one to solve them independently by using schemes that are optimized for each component. To demonstrate the feasibility of the method, we apply IFR to a simplified ocean–atmosphere model for heat-exchange coupled through the so-called bulk condition, common in ocean–atmosphere systems. We then solve this model on matching and non-matching grids to estimate numerically the convergence rates of the IFR coupling scheme.
Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.
We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.
The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia for focused collaborative research on Department of Energy (DOE) computer and computational science problems. The institute provides an opportunity for university researchers to learn about problems in computer and computational science at DOE laboratories. Participants conduct leading-edge research, interact with scientists and engineers at the laboratories, and help transfer results of their research to programs at the labs. Some specific CSRI research interest areas are: scalable solvers, optimization, adaptivity and mesh refinement, graph-based, discrete, and combinatorial algorithms, uncertainty estimation, mesh generation, dynamic load-balancing, virus and other malicious-code defense, visualization, scalable cluster computers, data-intensive computing, environments for scalable computing, parallel input/output, advanced architectures, and theoretical computer science. The CSRI Summer Program is organized by CSRI and typically includes the organization of a weekly seminar series and the publication of a summer proceedings. In 2020, the CSRI summer program was executed completely virtually; all student interns worked from home, due to the COVID-19 pandemic.
Research shows that individuals often overestimate their knowledge and performance without realizing they have done so, which can lead to faulty technical outcomes. This phenomenon is known as the Dunning-Kruger effect (Kruger & Dunning, 1999). This research sought to determine if some individuals were more prone to overestimating their performance due to underlying personality and cognitive characteristics. To test our hypothesis, we first collected individual difference measures. Next, we asked participants to estimate their performance on three performance tasks to assess the likelihood of overestimation. We found that some individuals may be more prone to overestimating their performance than others, and that faulty problem-solving abilities and low skill may be to blame. Encouraging individuals to think critically through all options and to consult with others before making a high-consequence decision may reduce overestimation.
Proceedings of MCHPC 2020: Workshop on Memory Centric High Performance Computing, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.
Predictions of the bulk scale thermal conductivity of solids using non-equilibrium molecular dynamics (MD) simulations have relied on the linear extrapolation of the thermal resistivity versus the reciprocal of the system length in the simulations. Several studies have reported deviation of the extrapolation from linearity near the micro-scale, raising a concern of its applicability to large systems. To investigate this issue, present work conducted extensive MD simulations of silicon with two different potentials (EDIP and Tersoff-II) for unprecedented length scales up to 10.3 μm and simulation times up to 530 ns. For large systems ≥0.35 μm in size the non-linearity of the extrapolation of the reciprocal of the thermal conductivity is mostly due to ignoring the dependence of the thermal conductivity on temperature. To account for such dependence, the present analysis fixes the temperature range for determining the gradient for calculating the thermal conductivity values. However, short systems ≤0.23 μm in size show significant non-linearity in the calculated thermal conductivity values using a temperature window of 500 ± 10 K from the simulations results with the EDIP potential. Since these system sizes are shorter than the mean phonon free path in EDIP (~0.22 μm), the nonlinearity may be attributed to phonon transport. For the MD simulations with the Tersoff-II potential there is no significant non-linearity in the calculated thermal conductivity values for systems ranging in size from 0.05 to 5.4 μm.
Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Dominguez-Trujillo, Jered; Haskins, Keira; Khouzani, Soheila J.; Leap, Christopher; Tashakkori, Sahba; Wofford, Quincy; Estrada, Trilce; Bridges, Patrick G.; Widener, Patrick W.
Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.
Adams, Brian M.; Bohnhoff, William J.; Dalbey, Keith R.; Ebeida, Mohamed S.; Eddy, John P.; Eldred, Michael S.; Hooper, Russell W.; Hough, Patricia D.; Hu, Kenneth T.; Jakeman, John D.; Khalil, Mohammad; Maupin, Kathryn A.; Monschke, Jason A.; Ridgway, Elliott M.; Rushdi, Ahmad; Seidl, Daniel T.; Stephens, John A.; Winokur, Justin G.
The Dakota toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the Dakota toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user’s manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.
Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Multithreaded MPI applications are gaining popularity in scientific and high-performance computing. While the combination of programming models is suited to support current parallel hardware, it moves threading models and their interaction with MPI into focus. With the advent of new threading libraries, the flexibility to select threading implementations of choice is becoming an important usability feature. Open MPI has traditionally avoided componentizing its threading model, relying on code inlining and static initialization to minimize potential impacts on runtime fast paths and synchronization. This paper describes the implementation of a generic threading runtime support in Open MPI using the Opal Modular Component Architecture. This architecture allows the programmer to select a threading library at compile-or run-time, providing both static initialization of threading primitives as well as dynamic instantiation of threading objects. In this work, we present the implementation, define required interfaces, and discuss trade-offs of dynamic and static initialization.
Proceedings of IA3 2020: 10th Workshop on Irregular Applications: Architectures and Algorithms, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multi-GPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.
This report summarizes the work performed under the project "Linear Programming in Strongly Polynomial Time." Linear programming (LP) is a classic combinatorial optimization problem heavily used directly and as an enabling subroutine in integer programming (IP). Specifically IP is the same as LP except that some solution variables must take integer values (e.g. to represent yes/no decisions). Together LP and IP have many applications in resource allocation including general logistics, and infrastructure design and vulnerability analysis. The project was motivated by the PI's recent success developing methods to efficiently sample Voronoi vertices (essentially finding nearest neighbors in high-dimensional point sets) in arbitrary dimension. His method seems applicable to exploring the high-dimensional convex feasible space of an LP problem. Although the project did not provably find a strongly-polynomial algorithm, it explored multiple algorithm classes. The new medial simplex algorithms may still lead to solvers with improved provable complexity. We describe medial simplex algorithms and some relevant structural/complexity results. We also designed a novel parallel LP algorithm based on our geometric insights and implemented it in the Spoke-LP code. A major part of the computational step is many independent vector dot products. Our parallel algorithm distributes the problem constraints across processors. Current commercial and high-quality free LP solvers require all problem details to fit onto a single processor or multicore. Our new algorithm might enable the solution of problems too large for any current LP solvers. We describe our new algorithm, give preliminary proof-of-concept experiments, and describe a new generator for arbitrarily large LP instances.
In this paper, we present an optimization-based coupling method for local and nonlocal continuum models. Our approach couches the coupling of the models into a control problem where the states are the solutions of the nonlocal and local equations, the objective is to minimize their mismatch on the overlap of the local and nonlocal problem domains, and the virtual controls are the nonlocal volume constraint and the local boundary condition. We present the method in the context of Local-to-Nonlocal di usion coupling. Numerical examples illustrate the theoretical properties of the approach.
The lack of a reliable method to evaluate the convergence of molecular dynamics simulations has contributed to discrepancies in different areas of molecular dynamics. In the present work, the method of information entropy is introduced to molecular dynamics for stationarity assessment. The Shannon information entropy formalism is used to monitor the convergence of the atom motion to a steady state in a continuous spatial domain and is also used to assess the stationarity of calculated multidimensional fields such as the temperature field in a discrete spatial domain. It is demonstrated in this work that monitoring the information entropy of the atom position matrix provides a clear indicator of reaching steady state in radiation damage simulations, non-equilibrium molecular dynamics thermal conductivity computations, and simulations of Poiseuille and Couette flow in nanochannels. A main advantage of the present technique is that it is non-local and relies on fundamental quantities available in all molecular dynamics simulations. Unlike monitoring average temperature, the technique is applicable to simulations that conserve total energy such as reverse non-equilibrium molecular dynamics thermal conductivity computations and to simulations where energy dissipates through a boundary as in radiation damage simulations. The method is applied to simulations of iron using the Tersoff/ZBL splined potential, silicon using the Stillinger-Weber potential, and to Lennard-Jones fluid. Its applicability to both solids and fluids shows that the technique has potential for generalization to other areas in molecular dynamics.
PyGSTi is a Python software package for assessing and characterizing the performance of quantum computing processors. It can be used as a standalone application, or as a library, to perform a wide variety of quantum characterization, verification, and validation (QCVV) protocols on as-built quantum processors. We outline pyGSTi's structure, and what it can do, using multiple examples. We cover its main characterization protocols with end-to-end implementations. These include gate set tomography, randomized benchmarking on one or many qubits, and several specialized techniques. We also discuss and demonstrate how power users can customize pyGSTi and leverage its components to create specialized QCVV protocols and solve user-specific problems.
This work presents the design of nonlinear stabilization techniques for the finite element discretization of Euler equations in both steady and transient form. Implicit time integration is used in the case of the transient form. A differentiable local bounds preserving method has been developed, which combines a Rusanov artificial diffusion operator and a differentiable shock detector. Nonlinear stabilization schemes are usually stiff and highly nonlinear. This issue is mitigated by the differentiability properties of the proposed method. Moreover, in order to further improve the nonlinear convergence, we also propose a continuation method for a subset of the stabilization parameters. The resulting method has been successfully applied to steady and transient problems with complex shock patterns. Numerical experiments show that it is able to provide sharp and well resolved shocks. The importance of the differentiability is assessed by comparing the new scheme with its non-differentiable counterpart. Numerical experiments suggest that, for up to moderate nonlinear tolerances, the method exhibits improved robustness and nonlinear convergence behavior for steady problems. In the case of transient problem, we also observe a reduction in the computational cost.
This report summarizes work done under the Verification, Validation, and Uncertainty Quantification (VVUQ) thrust area of the North American Energy Resilience Model (NAERM) Program. The specific task of interest described in this report is focused on sensitivity analysis of scenarios involving failures of both wind turbines and thermal generators under extreme cold-weather temperature conditions as would be observed in a Polar Vortex event.
A technique called the splice method for coupling local to peridynamic subregions of a body is described. The method relies on ghost nodes, whose values of displacement are interpolated from nearby physical nodes, to make each subregion visible to the other. In each time step, the nodes in each subregion treat the nodes in the other subregion as boundary conditions. Adaptively changing the subregions is possible through the creation and deletion of ghost nodes. Example problems in 2D and 3D illustrate how the method is used to perform multiscale modeling of fracture and impact events within a larger structure.
This report is an institutional record of experiments conducted to explore performance of a vendor installation of CephFS on the SNL stria cluster. Comparisons between CephFS, the Lustre parallel file system, and NFS were done using the IOR and MDTEST benchmarking tools, a test program which uses the SEACAS/Trilinos IOSS library, and the checkpointing activity performed by the LAMMPS molecular dynamics simulation.
Understanding the effects of contaminant plasmas generated within the Z machine at Sandia is critical to understanding current loss mechanisms. The plasmas are generated at the accelerator electrode surfaces and include desorbed species found in the surface and substrate of the walls. These desorbed species can become ionized. The timing and location of contaminant species desorbed from the wall surface depend non-linearly on the local surface temperature. For accurate modeling, it is necessary to utilize wall heating models to estimate the amount and timing of material desorption. One of these heating mechanisms is Joule heating. We propose several extended semi-analytic magnetic diffusion heating models for computing surface Joule heating and demonstrate their effects for several representative current histories. We quantitatively assess under what circumstances these extensions to classical formulas may provide a validatable improvement to the understanding of contaminant desorption timing.