A Monolithic Algebraic Multigrid Approach for Coupled Multiphysics Problems using the MueLu Framework
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Experimental Mechanics
This work explores the effect of the ill-posed problem on uncertainty quantification for motion estimation using digital image correlation (DIC) (Sutton et al. [2009]). We develop a correction factor for standard uncertainty estimates based on the cosine of the angle between the true motion and the image gradients, in an integral sense over a subregion of the image. This correction factor accounts for variability in the DIC solution previously unaccounted for when considering only image noise, interpolation bias, contrast, and the software settings such as subset size and spacing.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Environmental Modelling and Software
Sensitivity analysis (SA) is en route to becoming an integral part of mathematical modeling. The tremendous potential benefits of SA are, however, yet to be fully realized, both for advancing mechanistic and data-driven modeling of human and natural systems, and in support of decision making. In this perspective paper, a multidisciplinary group of researchers and practitioners revisit the current status of SA, and outline research challenges in regard to both theoretical frameworks and their applications to solve real-world problems. Six areas are discussed that warrant further attention, including (1) structuring and standardizing SA as a discipline, (2) realizing the untapped potential of SA for systems modeling, (3) addressing the computational burden of SA, (4) progressing SA in the context of machine learning, (5) clarifying the relationship and role of SA to uncertainty quantification, and (6) evolving the use of SA in support of decision making. An outlook for the future of SA is provided that underlines how SA must underpin a wide variety of activities to better serve science and society.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Computer Methods in Applied Mechanics and Engineering
We present a fully discrete approximation technique for the compressible Navier–Stokes equations that is second-order accurate in time and space, semi-implicit, and guaranteed to be invariant domain preserving. The restriction on the time step is the standard hyperbolic CFL condition, i.e. τ≲O(h)∕V where V is some reference velocity scale and h the typical meshsize.
Abstract not provided.
Mechanics Research Communications
The variational multiscale (VMS) formulation is used to develop residual-based VMS large eddy simulation (LES) models for Rayleigh-Bénard convection. The resulting model is a mixed model that incorporates the VMS model and an eddy viscosity model. The Wall-Adapting Local Eddy-viscosity (WALE) model is used as the eddy viscosity model in this work. The new LES models were implemented in the finite element code Drekar. Simulations are performed using continuous, piecewise linear finite elements. The simulations ranged from Ra=106 to Ra=1014 and were conducted at Pr=1 and Pr=7. Two domains were considered: a two-dimensional domain of aspect ratio 2 with a fluid confined between two parallel plates and a three-dimensional cylinder of aspect ratio 1/4. The Nusselt number from the VMS results is compared against three dimensional direct numerical simulations and experiments. In all cases, the VMS results are in good agreement with existing literature.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Physics of Plasmas
Macroscopic simulations of dense plasmas rely on detailed microscopic information that can be computationally expensive and is difficult to verify experimentally. In this work, we delineate the accuracy boundary between microscale simulation methods by comparing Kohn-Sham density functional theory molecular dynamics (KS-MD) and radial pair potential molecular dynamics (RPP-MD) for a range of elements, temperature, and density. By extracting the optimal RPP from KS-MD data using force matching, we constrain its functional form and dismiss classes of potentials that assume a constant power law for small interparticle distances. Our results show excellent agreement between RPP-MD and KS-MD for multiple metrics of accuracy at temperatures of only a few electron volts. The use of RPPs offers orders of magnitude decrease in computational cost and indicates that three-body potentials are not required beyond temperatures of a few eV. Due to its efficiency, the validated RPP-MD provides an avenue for reducing errors due to finite-size effects that can be on the order of ∼ 20 %.
Learning 3D representations that generalize well to arbitrarily oriented inputs is a challenge of practical importance in applications varying from computer vision to physics and chemistry. We propose a novel multi-resolution convolutional architecture for learning over concentric spherical feature maps, of which the single sphere representation is a special case. Our hierarchical architecture is based on alternatively learning to incorporate both intra-sphere and inter-sphere information. We show the applicability of our method for two different types of 3D inputs, mesh objects, which can be regularly sampled, and point clouds, which are irregularly distributed. We also propose an efficient mapping of point clouds to concentric spherical images, thereby bridging spherical convolutions on grids with general point clouds. We demonstrate the effectiveness of our approach in improving state-of-the-art performance on 3D classification tasks with rotated data.
Recently, Graph Neural Networks (GNNs) have received a lot of interest because of their success in learning representations from graph structured data. However, GNNs exhibit different compute and memory characteristics compared to traditional Deep Neural Networks (DNNs). Graph convolutions require feature aggregations from neighboring nodes (known as the aggregation phase), which leads to highly irregular data accesses. GNNs also have a very regular compute phase that can be broken down to matrix multiplications (known as the combination phase). All recently proposed GNN accelerators utilize different dataflows and microarchitecture optimizations for these two phases. Different communication strategies between the two phases have been also used. However, as more custom GNN accelerators are proposed, the harder it is to qualitatively classify them and quantitatively contrast them. In this work, we present a taxonomy to describe several diverse dataflows for running GNN inference on accelerators. This provides a structured way to describe and compare the design-space of GNN accelerators.
Abstract not provided.
Abstract not provided.
We study both conforming and non-conforming versions of the practical DPG method for the convection-reaction problem. We determine that the most common approach for DPG stability analysis (construction of a local Fortin operator) is infeasible for the convection-reaction problem. We then develop a line of argument based on the direct construction of a global Fortin operator; we find that employing a polynomial enrichment for the test space does not suffice for this purpose, motivating the introduction of a (two-element) subgrid mesh. The argument combines mathematical analysis with numerical experiments
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The explosion of both sensors and GPS-enabled devices has resulted in position/time data being the next big frontier for data analytics. However, many of the problems associated with large numbers of trajectories do not necessarily have an analog with many of the historic big-data applications such as text and image analysis. Modern trajectory analytics exploits much of the cutting-edge research in machine-learning, statistics, computational geometry and other disciplines. We will show that for doing trajectory analytics at scale, it is necessary to fundamentally change the way the information is represented through a feature-vector approach. We then demonstrate the ability to solve large trajectory analytics problems using this representation.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report focuses on the two primary goals set forth in Sandia’s TAFI effort, referred to here under the name Kebab. The first goal is to overlay a trajectory onto a large database of historical trajectories, all with very different sampling rates than the original track. We demonstrate a fast method to accomplish this, even for databases that hold over a million tracks. The second goal is to then demonstrate that these matched historical trajectories can be used to make predictions about unknown qualities associated with the original trajectory. As part of this work, we also examine the problem of defining the qualities of a trajectory in a reproducible way.
Computational Particle Mechanics
The peridynamic theory of solid mechanics is applied to modeling the deformation and fracture of micrometer-sized particles made of organic crystalline material. A new peridynamic material model is proposed to reproduce the elastic–plastic response, creep, and fracture that are observed in experiments. The model is implemented in a three-dimensional, meshless Lagrangian simulation code. In the small deformation, elastic regime, the model agrees well with classical Hertzian contact analysis for a sphere compressed between rigid plates. Under higher load, material and geometrical nonlinearity is predicted, leading to fracture. Finally, the material parameters for the energetic material CL-20 are evaluated from nanoindentation test data on the cyclic compression and failure of micrometer-sized grains.
In this position paper we will address challenges and opportunities relating to the design and codesign of application specific circuits. Given our background as computational scientists, our perspective is from the viewpoint of a highly motivated application developer as opposed to career computer architects
MLIR (Multi-Level Intermediate Representation), is an extensible compiler framework that supports high-level data structures and operation constructs. These higher-level code representations are particularly applicable to the artificial intelligence and machine learning (AI/ML) domain, allowing developers to more easily support upcoming heterogeneous AI/ML accelerators and develop flexible domain specific compilers/frameworks with higher-level intermediate representations (IRs) and advanced compiler optimizations. The result of using MLIR within the LLVM compiler framework is expected to yield significant improvement in the quality of generated machine code, which in turn will result in improved performance and hardware efficiency
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
For digital twins (DTs) to become a central fixture in mission critical systems, a better understanding is required of potential modes of failure, quantification of uncertainty, and the ability to explain a model’s behavior. These aspects are particularly important as the performance of a digital twin will evolve during model development and deployment for real-world operations.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Optimization Online Repository
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Computer Methods in Applied Mechanics and Engineering
A key challenge to nonlocal models is the analytical complexity of deriving them from first principles, and frequently their use is justified a posteriori. In this work we extract nonlocal models from data, circumventing these challenges and providing data-driven justification for the resulting model form. Extracting data-driven surrogates is a major challenge for machine learning (ML) approaches, due to nonlinearities and lack of convexity — it is particularly challenging to extract surrogates which are provably well-posed and numerically stable. Our scheme not only yields a convex optimization problem, but also allows extraction of nonlocal models whose kernels may be partially negative while maintaining well-posedness even in small-data regimes. To achieve this, based on established nonlocal theory, we embed in our algorithm sufficient conditions on the non-positive part of the kernel that guarantee well-posedness of the learnt operator. These conditions are imposed as inequality constraints to meet the requisite conditions of the nonlocal theory. We demonstrate this workflow for a range of applications, including reproduction of manufactured nonlocal kernels; numerical homogenization of Darcy flow associated with a heterogeneous periodic microstructure; nonlocal approximation to high-order local transport phenomena; and approximation of globally supported fractional diffusion operators by truncated kernels.
Results in Applied Mathematics
We present an optimization-based coupling method for local and nonlocal continuum models. Our approach couches the coupling of the models into a control problem where the states are the solutions of the nonlocal and local equations, the objective is to minimize their mismatch on the overlap of the local and nonlocal problem domains, and the virtual controls are the nonlocal volume constraint and the local boundary condition. We present the method in the context of Local-to-Nonlocal diffusion coupling. Numerical examples illustrate the theoretical properties of the approach.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - International Symposium on High-Performance Computer Architecture
Over the past decade as Moore's Law has slowed, the need for new forms of computation that can provide sustainable performance improvements has risen. A new method, called in situ computing, has shown great potential to accelerate matrix vector multiplication (MVM), an important kernel for a diverse range of applications from neural networks to scientific computing. Existing in situ accelerators for scientific computing, however, have a significant limitation: These accelerators provide no acceleration for preconditioning-A key bottleneck in linear solvers and in scientific computing workflows. This paper enables in situ acceleration for state-of-The-Art linear solvers by demonstrating how to use a new in situ matrix inversion accelerator for analog preconditioning. As existing techniques that enable high precision and scalability for in situ MVM are inapplicable to in situ matrix inversion, new techniques to compensate for circuit non-idealities are proposed. Additionally, a new approach to bit slicing that enables splitting operands across multiple devices without external digital logic is proposed. For scalability, this paper demonstrates how in situ matrix inversion kernels can work in tandem with existing domain decomposition techniques to accelerate the solutions of arbitrarily large linear systems. The analog kernel can be directly integrated into existing preconditioning workflows, leveraging several well-optimized numerical linear algebra tools to improve the behavior of the circuit. The result is an analog preconditioner that is more effective (up to 50% fewer iterations) than the widely used incomplete LU factorization preconditioner, ILU(0), while also reducing the energy and execution time of each approximate solve operation by 1025x and 105x respectively.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - International Symposium on High-Performance Computer Architecture
Non-volatile memories (NVMs) have the characteristics of both traditional storage systems (persistent) and traditional memory systems (byte-Addressable). However, they suffer from high write latency and have a limited write endurance. Researchers have proposed hybrid memory systems that combine DRAM and NVM, utilizing the lower latency of the DRAM to hide some of the shortcomings of the NVM-improving system's performance by caching resident NVM data in the DRAM. However, this can nullify the persistency of the cached pages, leading to a question of trade-offs in terms of performance and reliability. In this paper, we propose Stealth-Persist, a novel architecture support feature that allows applications that need persistence to run in the DRAM while maintaining the persistency features provided by the NVM. Stealth-Persist creates the illusion of a persistent memory for the application to use, while utilizing the DRAM for performance optimizations. Our experimental results show that Stealth-Persist improves the performance by 42.02% for persistent applications.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - International Symposium on High-Performance Computer Architecture
1 The exponential growth of data has driven technology providers to develop new protocols, such as cache coherent interconnects and memory semantic fabrics, to help users and facilities leverage advances in memory technologies to satisfy these growing memory and storage demands. Using these new protocols, fabric-Attached memories (FAM) can be directly attached to a system interconnect and be easily integrated with a variety of processing elements (PEs). Moreover, systems that support FAM can be smoothly upgraded and allow multiple PEs to share the FAM memory pools using well-defined protocols. The sharing of FAM between PEs allows efficient data sharing, improves memory utilization, reduces cost by allowing flexible integration of different PEs and memory modules from several vendors, and makes it easier to upgrade the system. One promising use-case for FAMs is in High-Performance Compute (HPC) systems, where the underutilization of memory is a major challenge. However, adopting FAMs in HPC systems brings new challenges. In addition to cost, flexibility, and efficiency, one particular problem that requires rethinking is virtual memory support for security and performance. To address these challenges, this paper presents decoupled access control and address translation (DeACT), a novel virtual memory implementation that supports HPC systems equipped with FAM. Compared to the state-of-The-Art two-level translation approach, DeACT achieves speedup of up to 4.59x (1.8x on average) without compromising security.1Part of this work was done when Vamsee was working under the supervision of Amro Awad at UCF. Amro Awad is now with the ECE Department at NC State.
Abstract not provided.
n this presentation we will discuss recent results on using the SpiNNaker neuromorphic platform (48-chip model) for deep learning neural network inference. We use the Sandia Labs developed Whet stone spiking deep learning library to train deep multi-layer perceptrons and convolutional neural networks suitable for the spiking substrate on the neural hardware architecture. By using the massively parallel nature of SpiNNaker, we are able to achieve, under certain network topologies, substantial network tiling and consequentially impressive inference throughput. Such high-throughput systems may have eventual application in remote sensing applications where large images need to be chipped, scanned, and processed quickly. Additionally, we explore complex topologies that push the limits of the SpiNNaker routing hardware and investigate how that impacts mapping software-implemented networks to on-hardware instantiations.
Journal of Physical Chemistry C
Diborane (B2H6) is a promising molecular precursor for atomic precision p-type doping of silicon that has recently been experimentally demonstrated [ Škereň et al. Nat. Electron. 2020 ]. We use density functional theory (DFT) calculations to determine the reaction pathway for diborane dissociating into a species that will incorporate as electrically active substitutional boron after adsorbing onto the Si(100)-2×1 surface. Our calculations indicate that diborane must overcome an energy barrier to adsorb, explaining the experimentally observed low sticking coefficient (<1 × 10-4 at room temperature) and suggesting that heating can be used to increase the adsorption rate. Upon sticking, diborane has an ≈50% chance of splitting into two BH3 fragments versus merely losing hydrogen to form a dimer such as B2H4. As boron dimers are likely electrically inactive, whether this latter reaction occurs is shown to be predictive of the incorporation rate. The dissociation process proceeds with significant energy barriers, necessitating the use of high temperatures for incorporation. Using the barriers calculated from DFT, we parameterize a Kinetic Monte Carlo model that predicts the incorporation statistics of boron as a function of the initial depassivation geometry, dose, and anneal temperature. Our results suggest that the dimer nature of diborane inherently limits its doping density as an acceptor precursor and furthermore that heating the boron dimers to split before exposure to silicon can lead to poor selectivity on hydrogen and halogen resists. This suggests that, while diborane works as an atomic precision acceptor precursor, other non-dimerized acceptor precursors may lead to higher incorporation rates at lower temperatures.
Programmable accelerators have become commonplace in modern computing systems. Advances in programming models and the availability of unprecedented amounts of data have created a space for massively parallel accelerators capable of maintaining context for thousands of concurrent threads resident on-chip. These threads are grouped and interleaved on a cycle-by-cycle basis among several massively parallel computing cores. One path for the design of future supercomputers relies on an ability to model the performance of these massively parallel cores at scale. The SST framework has been proven to scale up to run simulations containing tens of thousands of nodes. A previous report described the initial integration of the open-source, execution-driven GPU simulator, GPGPU-Sim, into the SST framework. This report discusses the results of the integration and how to use the new GPU component in SST. It also provides examples of what it can be used to analyze and a correlation study showing how closely the execution matches that of a Nvidia V100 GPU when running kernels and mini-apps.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.
Proceedings of the ASME Design Engineering Technical Conference
Bayesian optimization (BO) is a flexible and powerful framework that is suitable for computationally expensive simulation-based applications and guarantees statistical convergence to the global optimum. While remaining as one of the most popular optimization methods, its capability is hindered by the size of data, the dimensionality of the considered problem, and the nature of sequential optimization. These scalability issues are intertwined with each other and must be tackled simultaneously. In this work, we propose the Scalable3-BO framework, which employs sparse GP as the underlying surrogate model to scope with Big Data and is equipped with a random embedding to efficiently optimize high-dimensional problems with low effective dimensionality. The Scalable3-BO framework is further leveraged with asynchronous parallelization feature, which fully exploits the computational resource on HPC within a computational budget. As a result, the proposed Scalable3-BO framework is scalable in three independent perspectives: with respect to data size, dimensionality, and computational resource on HPC. The goal of this work is to push the frontiers of BO beyond its well-known scalability issues and minimize the wall-clock waiting time for optimizing high-dimensional computationally expensive applications. We demonstrate the capability of Scalable3-BO with 1 million data points, 10,000-dimensional problems, with 20 concurrent workers in an HPC environment.
Proceedings of the ASME Design Engineering Technical Conference
Bayesian optimization (BO) is a flexible and powerful framework that is suitable for computationally expensive simulation-based applications and guarantees statistical convergence to the global optimum. While remaining as one of the most popular optimization methods, its capability is hindered by the size of data, the dimensionality of the considered problem, and the nature of sequential optimization. These scalability issues are intertwined with each other and must be tackled simultaneously. In this work, we propose the Scalable3-BO framework, which employs sparse GP as the underlying surrogate model to scope with Big Data and is equipped with a random embedding to efficiently optimize high-dimensional problems with low effective dimensionality. The Scalable3-BO framework is further leveraged with asynchronous parallelization feature, which fully exploits the computational resource on HPC within a computational budget. As a result, the proposed Scalable3-BO framework is scalable in three independent perspectives: with respect to data size, dimensionality, and computational resource on HPC. The goal of this work is to push the frontiers of BO beyond its well-known scalability issues and minimize the wall-clock waiting time for optimizing high-dimensional computationally expensive applications. We demonstrate the capability of Scalable3-BO with 1 million data points, 10,000-dimensional problems, with 20 concurrent workers in an HPC environment.
Proceedings - IEEE International Conference on Cluster Computing, ICCC
The FAIR principles of open science (Findable, Accessible, Interoperable, and Reusable) have had transformative effects on modern large-scale computational science. In particular, they have encouraged more open access to and use of data, an important consideration as collaboration among teams of researchers accelerates and the use of workflows by those teams to solve problems increases. How best to apply the FAIR principles to workflows themselves, and software more generally, is not yet well understood. We argue that the software engineering concept of technical debt management provides a useful guide for application of those principles to workflows, and in particular that it implies reusability should be considered as 'first among equals'. Moreover, our approach recognizes a continuum of reusability where we can make explicit and selectable the tradeoffs required in workflows for both their users and developers. To this end, we propose a new abstraction approach for reusable workflows, with demonstrations for both synthetic workloads and real-world computational biology workflows. Through application of novel systems and tools that are based on this abstraction, these experimental workflows are refactored to rightsize the granularity of workflow components to efficiently fill the gap between end-user simplicity and general customizability. Our work makes it easier to selectively reason about and automate the connections between trade-offs across user and developer concerns when exposing degrees of freedom for reuse. Additionally, by exposing fine-grained reusability abstractions we enable performance optimizations, as we demonstrate on both institutional-scale and leadership-class HPC resources.
CEUR Workshop Proceedings
Second-order optimizers hold intriguing potential for deep learning, but suffer from increased cost and sensitivity to the non-convexity of the loss surface as compared to gradient-based approaches. We introduce a coordinate descent method to train deep neural networks for classification tasks that exploits global convexity of the cross-entropy loss in the weights of the linear layer. Our hybrid Newton/Gradient Descent (NGD) method is consistent with the interpretation of hidden layers as providing an adaptive basis and the linear layer as providing an optimal fit of the basis to data. By alternating between a second-order method to find globally optimal parameters for the linear layer and gradient descent to train the hidden layers, we ensure an optimal fit of the adaptive basis to data throughout training. The size of the Hessian in the second-order step scales only with the number weights in the linear layer and not the depth and width of the hidden layers; furthermore, the approach is applicable to arbitrary hidden layer architecture. Previous work applying this adaptive basis perspective to regression problems demonstrated significant improvements in accuracy at reduced training cost, and this work can be viewed as an extension of this approach to classification problems. We first prove that the resulting Hessian matrix is symmetric semi-definite, and that the Newton step realizes a global minimizer. By studying classification of manufactured two-dimensional point cloud data, we demonstrate both an improvement in validation error and a striking qualitative difference in the basis functions encoded in the hidden layer when trained using NGD. Application to image classification benchmarks for both dense and convolutional architectures reveals improved training accuracy, suggesting gains of second-order methods over gradient descent. A Tensorflow implementation of the algorithm is available at github.com/rgp62/.
Proceedings of SPIE - The International Society for Optical Engineering
Neural network approaches have periodically been explored in the pursuit of high performing SAR ATR solutions. With deep neural networks (DNNs) now offering many state-of-The-Art solutions to computer vision tasks, neural networks are once again being revisited for ATR processing. Here, we characterize and explore a suite of neural network architectural topologies. In doing so, we assess how different architectural approaches impact performance and consider the associated computational costs. This includes characterizing network depth, width, scale, connectivity patterns, as well as convolution layer optimizations. We have explored a suite of architectural topologies applied to both the canonical MSTAR dataset, as well as the more operationally realistic Synthetic and Measured Paired and Labeled Experiment (SAMPLE) dataset. The latter pairs high fidelity computational models of targets with actual measured SAR data. Effectively, this dataset offers the ability to train a DNN on simulated data and test the network performance on measured data. Not only does our in-depth architecture topology analysis offer insight into how different architectural approaches impact performance, but we also have trained DNNs attaining state-of-The-Art performance on both datasets. Furthermore, beyond just accuracy, we also assess how efficiently an accelerator architecture executes these neural networks. Specifically, Using an analytical assessment tool, we forecast energy and latency for an edge TPU like architecture. Taken together, this tradespace exploration offers insight into the interplay of accuracy, energy, and latency for executing these networks.
Abstract not provided.
Minerals, Metals and Materials Series
Process-structure linkage is one of the most important topics in materials science due to the fact that virtually all information related to the materials, including manufacturing processes, lies in the microstructure itself. Therefore, to learn more about the process, one must start by thoroughly examining the microstructure. This gives rise to inverse problems in the context of process-structure linkages, which attempt to identify the processes that were used to manufacturing the given microstructure. In this work, we present an inverse problem for structure-process linkages which we solve using asynchronous parallel Bayesian optimization which exploits parallel computing resources. We demonstrate the effectiveness of the method using kinetic Monte Carlo model for grain growth simulation.
SIAM Journal on Numerical Analysis
Reproducing kernel (RK) approximations are meshfree methods that construct shape functions from sets of scattered data. We present an asymptotically compatible (AC) RK collocation method for nonlocal diffusion models with Dirichlet boundary condition. The numerical scheme is shown to be convergent to both nonlocal diffusion and its corresponding local limit as nonlocal interaction vanishes. The analysis is carried out on a special family of rectilinear Cartesian grids for a linear RK method with designed kernel support. The key idea for the stability of the RK collocation scheme is to compare the collocation scheme with the standard Galerkin scheme, which is stable. In addition, assembling the stiffness matrix of the nonlocal problem requires costly computational resources because high-order Gaussian quadrature is necessary to evaluate the integral. We thus provide a remedy to the problem by introducing a quasi-discrete nonlocal diffusion operator for which no numerical quadrature is further needed after applying the RK collocation scheme. The quasi-discrete nonlocal diffusion operator combined with RK collocation is shown to be convergent to the correct local diffusion problem by taking the limits of nonlocal interaction and spatial resolution simultaneously. The theoretical results are then validated with numerical experiments. We additionally illustrate a connection between the proposed technique and an existing optimization based approach based on generalized moving least squares.
AIAA Scitech 2021 Forum
Aeroengines ingest foreign object debris such as sand, which eventually erode components through repeated impacts. Due to the wide feature space, modeling and simulations are needed to rapidly assess the erosion behavior of materials such as composites. Peridynamic simulations were performed to analyze erosion of SiC/SiC composite due to sand impacts, which gives direct insight into the impact erosion mechanism and amounts. The erosion data was strongly correlated to impact velocity and angle, providing predictive equations.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.
SIAM Journal on Scientific Computing
The purpose of this paper is to study a Helmholtz problem with a spectral fractional Laplacian, instead of the standard Laplacian. Recently, it has been established that such a fractional Helmholtz problem better captures the underlying behavior in geophysical electromagnetics. We establish the well-posedness and regularity of this problem. We introduce a hybrid spectral-finite element approach to discretize it and show well-posedness of the discrete system. In addition, we derive a priori discretization error estimates. Finally, we introduce an efficient solver that scales as well as the best possible solver for the classical integer-order Helmholtz equation. We conclude with several illustrative examples that confirm our theoretical findings.
Proceedings of the International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering, M and C 2021
Conditional Point Sampling (CoPS) is a recently developed stochastic media transport algorithm that has demonstrated a high degree of accuracy in 1-D and 3-D calculations for binary mixtures with Markovian mixing statistics. In theory, CoPS has the capacity to be accurate for material structures beyond just those with Markovian statistics. However, realizing this capability will require development of conditional probability functions (CPFs) that are based, not on explicit Markovian properties, but rather on latent properties extracted from material structures. Here, we describe a first step towards extracting these properties by developing CPFs using deep neural networks (DNNs). Our new approach lays the groundwork for enabling accurate transport on many classes of stochastic media. We train DNNs on ternary stochastic media with Markovian mixing statistics and compare their CPF predictions to those made by existing CoPS CPFs, which are derived based on Markovian mixing properties. We find that the DNN CPF predictions usually outperform the existing approximate CPF predictions, but with wider variance. In addition, even when trained on only one material volume realization, the DNN CPFs are shown to make accurate predictions on other realizations that have the same internal mixing behavior. We show that it is possible to form a useful CoPS CPF by using a DNN to extract correlation properties from realizations of stochastically mixed media, thus establishing a foundation for creating CPFs for mixtures other than those with Markovian mixing, where it may not be possible to derive an accurate analytical CPF.
CEUR Workshop Proceedings
The data-driven discrete exterior calculus (DDEC) structure provides a novel machine learning architecture for discovering structure-preserving models which govern data, allowing for example machine learning of reduced order models for complex continuum scale physical systems. In this work, we present a Greedy Fiedler Spectral (GFS) partitioning method to obtain a chain complex structure to support DDEC models, incorporating synthetic data obtained from high-fidelity solutions to partial differential equations. We provide justification for the effectiveness of the resulting chain complex and demonstrate its DDEC model trained for Darcy flow on a heterogeneous domain.
Abstract not provided.
International Series in Operations Research and Management Science
A key strategy for protecting municipal water supplies is the use of sensors to detect the presence of contaminants in associated water distribution systems. Deploying a contamination warning system involves the placement of a limited number of sensors—placed in order to maximize the level of protection afforded. Researchers have proposed several models and algorithms for generating such placements, each optimizing with respect to a different design objective. The use of disparate design objectives raises several questions: (1) What is the relationship between optimal sensor placements for different design objectives? and (2) Is there any risk in focusing on specific design objectives? We model the sensor placement problem via a mixed-integer programming formulation of the well-known p-median problem from facility location theory to answer these questions. Our model can express a broad range of design objectives. Using three large test networks, we show that optimal solutions with respect to one design objective are often highly sub-optimal with respect to other design objectives. However, it is sometimes possible to construct solutions that are simultaneously near-optimal with respect to a range of design objectives. The design of contamination warning systems thus requires careful and simultaneous consideration of multiple, disparate design objectives.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.
2021 Silicon Nanoelectronics Workshop, SNW 2021
We propose a vertical TFET using atomic precision advanced manufacturing (APAM) to create an abrupt buried n++-doped source. We developed a gate stack that preserves the APAM source to accumulate holes above it, with a goal of band-to-band tunneling (BTBT) perpendicular to the gate – critical for the proposed device. A metal-insulator-semiconductor (MIS) capacitor shows hole accumulation above the APAM source, corroborated by simulation, demonstrating the TFET’s feasibility.
Abstract not provided.
CSPlib is an open source software library for analyzing general ordinary differential equation (ODE) systems and detailed chemical kinetic ODE systems. It relies on the computational singular perturbation (CSP) method for the analysis of these systems. The software provides support for: General ODE models (gODE model class) for computing source terms and Jacobians for a generic ODE system; TChem model (ChemElemODETChem model class) for computing source term, Jacobian, other necessary chemical reaction data, as well as the rates of progress for a homogenous batch reactor using an elementary step detailed chemical kinetic reaction mechanism. This class relies on the TChem [2] library; A set of functions to compute essential elements of CSP analysis (Kernel class). This includes computations of the eigensolution of the Jacobian matrix, CSP basis vectors and co-vectors, time scales (reciprocals of the magnitudes of the Jacobian eigenvalues), mode amplitudes, CSP pointers, and the number of exhausted modes. This class relies on the Tines library; A set of functions to compute the eigensolution of the Jacobian matrix using Tines library GPU eigensolver; A set of functions to compute CSP indices (Index Class). This includes participation indices and both slow and fast importance indices.
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A source-to-source compiler is a type of translator that accepts the source code of a program written in a programming language as its input and produces an equivalent source code in the same or different programming language. S2S techniques are commonly used to enable fluent translation between high-level programming languages, to perform large-scale refactoring operations, and to facilitate instrumentation for dynamic analysis. Negative perceptions about S2S’s applicability in High Performance Computing (HPC) are studied and evaluated here. This is a first study that brings to light reasons why scientists do not use source-to-source techniques for HPC. The primary audience for this paper are those considering S2S technology in their HPC application work.
Proceedings of the 29th International Meshing Roundtable, IMR 2021
Interval Assignment (IA) is the problem of selecting the number of mesh edges (intervals) for each curve for conforming quad and hex meshing. The intervals x is fundamentally integer-valued, yet many approaches perform floating-point optimization and convert a floating-point solution into an integer solution. We avoid such steps: we start integer, stay integer. Incremental Interval Assignment (IIA) uses integer linear algebra (Hermite normal form) to find an initial solution to the matrix equation Ax = b satisfying the meshing constraints. Solving for reduced row echelon form provides integer vectors spanning the nullspace of A. We add vectors from the nullspace to improve the initial solution. Compared to floating-point optimization approaches, IIA is faster and always produces an integer solution. The potential drawback is that there is no theoretical guarantee that the solution is optimal, but in practice we achieve solutions close to the user goals. The software is freely available.
Proceedings - IEEE International Conference on Cluster Computing, ICCC
Persistent memory (PMEM) devices can achieve comparable performance to DRAM while providing significantly more capacity. This has made the technology compelling as an expansion to main memory. Rethinking PMEM as storage devices can offer a high performance buffering layer for HPC applications to temporarily, but safely store data. However, modern parallel I/O libraries, such as HDF5 and pNetCDF, are complicated and introduce significant software and metadata overheads when persisting data to these storage devices, wasting much of their potential. In this work, we explore the potential of PMEM as storage through pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory. We demonstrate that our approach is up to 2x faster than other popular parallel I/O libraries under real workloads.
Journal of Parallel and Distributed Computing
Rendezvous algorithms encode a communication pattern that is useful when processors sending data do not know who the receiving processors should be, or vice versa. The idea is to define an intermediate decomposition where datums from different sending processors can ”rendezvous” to perform a computation, in a manner that both the senders and eventual receivers of the results can identify the appropriate rendezvous processor. Originally designed for interpolating between overlaid grids with independent parallel decompositions (Plimpton et al., 2004), we have recently found rendezvous algorithms useful for a variety of operations in particle- or grid-based simulation codes when running large problems on large numbers of processors. In particular, we show they can perform well when a load-balanced intermediate decomposition is randomized and not spatial, requiring all-to-all communication to move data between processors. In this case rendezvous algorithms leverage the large bisection communication bandwidths which parallel machines provide. We describe how rendezvous algorithms work in a scientific computing context and give specific examples for molecular dynamics and Direct Simulation Monte Carlo codes which result in dramatic performance improvements versus simpler algorithms which do not scale as well. We explain how a generic rendezvous algorithm can be implemented, and also point out similarities with the MapReduce paradigm popularized by Google and Hadoop.
Proceedings of the International Conference on Mathematics and Computational Methods Applied to Nuclear Science and Engineering, M and C 2021
The accurate construction of a surrogate model is an effective and efficient strategy for performing Uncertainty Quantification (UQ) analyses of expensive and complex engineering systems. Surrogate models are especially powerful whenever the UQ analysis requires the computation of statistics which are difficult and prohibitively expensive to obtain via a direct sampling of the model, e.g. high-order moments and probability density functions. In this paper, we discuss the construction of a polynomial chaos expansion (PCE) surrogate model for radiation transport problems for which quantities of interest are obtained via Monte Carlo simulations. In this context, it is imperative to account for the statistical variability of the simulator as well as the variability associated with the uncertain parameter inputs. More formally, in this paper we focus on understanding the impact of the Monte Carlo transport variability on the recovery of the PCE coefficients. We are able to identify the contribution of both the number of uncertain parameter samples and the number of particle histories simulated per sample in the PCE coefficient recovery. Our theoretical results indicate an accuracy improvement when using few Monte Carlo histories per random sample with respect to configurations with an equivalent computational cost. These theoretical results are numerically illustrated for a simple synthetic example and two configurations of a one-dimensional radiation transport problem in which a slab is represented by means of materials with uncertain cross sections.
35th AAAI Conference on Artificial Intelligence, AAAI 2021
This work proposes an approach for latent-dynamics learning that exactly enforces physical conservation laws. The method comprises two steps. First, the method computes a low-dimensional embedding of the high-dimensional dynamical-system state using deep convolutional autoencoders. This defines a low-dimensional nonlinear manifold on which the state is subsequently enforced to evolve. Second, the method defines a latent-dynamics model that associates with the solution to a constrained optimization problem. Here, the objective function is defined as the sum of squares of conservation-law violations over control volumes within a finite-volume discretization of the problem; nonlinear equality constraints explicitly enforce conservation over prescribed subdomains of the problem. Under modest conditions, the resulting dynamics model guarantees that the time-evolution of the latent state exactly satisfies conservation laws over the prescribed subdomains.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific applications. Existing tools consider either the worst case congestion with small abstract patterns or peak performance with simplistic patterns. StressBench allows for a richer study of congestion by allowing orchestration of network load scenarios that are representative of those typically seen at HPC centres, something that is difficult to achieve with existing tools. We demonstrate the versatility of the framework from micro benchmarks through to finely controlled congested runs across a cluster. Validation of the results using four proxy application communication schemes within StressBench against parent applications shows a maximum difference of 15%. Using the I/O modeling capabilities of StressBench, we are able to quantify the impact of file I/O on application traffic showing how it can be used in procurement and performance studies.
Abstract not provided.
Abstract not provided.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.
Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW codesign ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations (CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.
JOM
Determining process–structure–property linkages is one of the key objectives in material science, and uncertainty quantification plays a critical role in understanding both process–structure and structure–property linkages. In this work, we seek to learn a distribution of microstructure parameters that are consistent in the sense that the forward propagation of this distribution through a crystal plasticity finite element model matches a target distribution on materials properties. This stochastic inversion formulation infers a distribution of acceptable/consistent microstructures, as opposed to a deterministic solution, which expands the range of feasible designs in a probabilistic manner. To solve this stochastic inverse problem, we employ a recently developed uncertainty quantification framework based on push-forward probability measures, which combines techniques from measure theory and Bayes’ rule to define a unique and numerically stable solution. This approach requires making an initial prediction using an initial guess for the distribution on model inputs and solving a stochastic forward problem. To reduce the computational burden in solving both stochastic forward and stochastic inverse problems, we combine this approach with a machine learning Bayesian regression model based on Gaussian processes and demonstrate the proposed methodology on two representative case studies in structure–property linkages.
Minerals, Metals and Materials Series
Microstructure reconstruction is a long-standing problem in experimental and computational materials science, for which numerous attempts have been made to solve. However, the majority of approaches often treats microstructure as discrete phases, which, in turn, reduces the quality of the resulting microstructures and limits its usage to the computational level of fidelity, but not the experimental level of fidelity. In this work, we applied our previously proposed approach [41] to generate synthetic microstructure images at the experimental level of fidelity for the UltraHigh Carbon Steel DataBase (UHCSDB) [13].
Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW codesign ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations (CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.
Proceedings of the 2021 International Topical Meeting on Probabilistic Safety Assessment and Analysis, PSA 2021
The prevalence effect is the observation that, in visual search tasks as the signal (target) to noise (non-target) ratio becomes smaller, humans are more likely to miss the target when it does occur. Studied extensively in the basic literature [e.g., 1, 2], this effect has implications for real-world settings such as security guards monitoring physical facilities for attacks. Importantly, what seems to drive the effect is the development of a response bias based on learned sensitivity to the statistical likelihood of a target [e.g., 3-5]. This paper presents results from two experiments aimed at understanding how the target prevalence impacts the ability for individuals to detect a target on the 1,000th trial of a series of 1000 trials. The first experiment employed the traditional prevalence effect paradigm. This paradigm involves search for a perfect capital letter T amidst imperfect Ts. In a between-subjects design, our subjects experienced target prevalence rates of 50/50, 1/10, 1/100, or 1/1000. In all conditions, the final trial was always a target. The second (ongoing) experiment replicates this design using a notional physical facility in a mod/sim environment. This simulation enables triggering different intrusion detection sensors by simulated characters and events (e.g., people, animals, weather). In this experiment, subjects viewed 1000 “alarm” events and were asked to characterize each as either a nuisance alarm (e.g., set off by an animal) or an attack. As with the basic visual search study, the final trial was always an attack.
CEUR Workshop Proceedings
We show that machine learning can improve the accuracy of simulations of stress waves in one-dimensional composite materials. We propose a data-driven technique to learn nonlocal constitutive laws for stress wave propagation models. The method is an optimization-based technique in which the nonlocal kernel function is approximated via Bernstein polynomials. The kernel, including both its functional form and parameters, is derived so that when used in a nonlocal solver, it generates solutions that closely match high-fidelity data. The optimal kernel therefore acts as a homogenized nonlocal continuum model that accurately reproduces wave motion in a smaller-scale, more detailed model that can include multiple materials. We apply this technique to wave propagation within a heterogeneous bar with a periodic microstructure. Several one-dimensional numerical tests illustrate the accuracy of our algorithm. The optimal kernel is demonstrated to reproduce high-fidelity data for a composite material in applications that are substantially different from the problems used as training data.
Abstract not provided.
Abstract not provided.
SIAM Journal on Control and Optimization
In this paper, we introduce and analyze a new class of optimal control problems constrained by elliptic equations with uncertain fractional exponents. We utilize risk measures to formulate the resulting optimization problem. We develop a functional analytic framework, study the existence of solution, and rigorously derive the first-order optimality conditions. Additionally, we employ a sample-based approximation for the uncertain exponent and the finite element method to discretize in space. We prove the rate of convergence for the optimal risk neutral controls when using quadrature approximation for the uncertain exponent and conclude with illustrative examples.
Quantum
Gate set tomography (GST) is a protocol for detailed, predictive characterization of logic operations (gates) on quantum computing processors. Early versions of GST emerged around 2012-13, and since then it has been refined, demonstrated, and used in a large number of experiments. This paper presents the foundations of GST in comprehensive detail. The most important feature of GST, compared to older state and process tomography protocols, is that it is calibration-free. GST does not rely on pre-calibrated state preparations and measurements. Instead, it characterizes all the operations in a gate set simultaneously and self-consistently, relative to each other. Long sequence GST can estimate gates with very high precision and efficiency, achieving Heisenberg scaling in regimes of practical interest. In this paper, we cover GST’s intellectual history, the techniques and experiments used to achieve its intended purpose, data analysis, gauge freedom and fixing, error bars, and the interpretation of gauge-fixed estimates of gate sets. Our focus is fundamental mathematical aspects of GST, rather than implementation details, but we touch on some of the foundational algorithmic tricks used in the pyGSTi implementation.
Abstract not provided.
Abstract not provided.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.
Abstract not provided.
Numerical simulations of Greenland and Antarctic ice sheets involve the solution of large-scale highly nonlinear systems of equations on complex shallow geometries. This work is concerned with the construction of Schwarz preconditioners for the solution of the associated tangent problems, which are challenging for solvers mainly because of the strong anisotropy of the meshes and wildly changing boundary conditions that can lead to poorly constrained problems on large portions of the domain. Here, two-level GDSW (Generalized Dryja–Smith–Widlund) type Schwarz preconditioners are applied to different land ice problems, i.e., a velocity problem, a temperature problem, as well as the coupling of the former two problems. We employ the MPI-parallel implementation of multi-level Schwarz preconditioners provided by the package FROSch (Fast and Robust Schwarz)from the Trilinos library. The strength of the proposed preconditioner is that it yields out-of-the-box scalable and robust preconditioners for the single physics problems. To our knowledge, this is the first time two-level Schwarz preconditioners are applied to the ice sheet problem and a scalable preconditioner has been used for the coupled problem. The pre-conditioner for the coupled problem differs from previous monolithic GDSW preconditioners in the sense that decoupled extension operators are used to compute the values in the interior of the sub-domains. Several approaches for improving the performance, such as reuse strategies and shared memory OpenMP parallelization, are explored as well. In our numerical study we target both uniform meshes of varying resolution for the Antarctic ice sheet as well as non uniform meshes for the Greenland ice sheet are considered. We present several weak and strong scaling studies confirming the robustness of the approach and the parallel scalability of the FROSch implementation. Among the highlights of the numerical results are a weak scaling study for up to 32 K processor cores (8 K MPI-ranks and 4 OpenMP threads) and 566 M degrees of freedom for the velocity problem as well as a strong scaling study for up to 4 K processor cores (and MPI-ranks) and 68 M degrees of freedom for the coupled problem.
International Journal for Uncertainty Quantification
Network modeling is a powerful tool to enable rapid analysis of complex systems that can be challenging to study directly using physical testing. Two approaches are considered: emulation and simulation. The former runs real software on virtualized hardware, while the latter mimics the behavior of network components and their interactions in software. Although emulation provides an accurate representation of physical networks, this approach alone cannot guarantee the characterization of the system under realistic operative conditions. Operative conditions for physical networks are often characterized by intrinsic variability (payload size, packet latency, etc.) or a lack of precise knowledge regarding the network configuration (bandwidth, delays, etc.); therefore uncertainty quantification (UQ) strategies should be also employed. UQ strategies require multiple evaluations of the system with a number of evaluation instances that roughly increases with the problem dimensionality, i.e., the number of uncertain parameters. It follows that a typical UQ workflow for network modeling based on emulation can easily become unattainable due to its prohibitive computational cost. In this paper, a multifidelity sampling approach is discussed and applied to network modeling problems. The main idea is to optimally fuse information coming from simulations, which are a low-fidelity version of the emulation problem of interest, in order to decrease the estimator variance. By reducing the estimator variance in a sampling approach it is usually possible to obtain more reliable statistics and therefore a more reliable system characterization. Several network problems of increasing difficulty are presented. For each of them, the performance of the multifidelity estimator is compared with respect to the single fidelity counterpart, namely, Monte Carlo sampling. For all the test problems studied in this work, the multifidelity estimator demonstrated an increased efficiency with respect to MC.
Abstract not provided.
Abstract not provided.
SIAM Journal on Numerical Analysis
We consider the integral definition of the fractional Laplacian and analyze a linearquadratic optimal control problem for the so-called fractional heat equation; control constraints are also considered. We derive existence and uniqueness results, first order optimality conditions, and regularity estimates for the optimal variables. To discretize the state equation we propose a fully discrete scheme that relies on an implicit finite difference discretization in time combined with a piecewise linear finite element discretization in space. We derive stability results and a novel L2(0, T;L2(Ω)) a priori error estimate. On the basis of the aforementioned solution technique, we propose a fully discrete scheme for our optimal control problem that discretizes the control variable with piecewise constant functions, and we derive a priori error estimates for it. We illustrate the theory with one- and two-dimensional numerical experiments.
ESAIM: Mathematical Modelling and Numerical Analysis
In this paper we consider 2D nonlocal diffusion models with a finite nonlocal horizon parameter δ characterizing the range of nonlocal interactions, and consider the treatment of Neumann-like boundary conditions that have proven challenging for discretizations of nonlocal models. We propose a new generalization of classical local Neumann conditions by converting the local flux to a correction term in the nonlocal model, which provides an estimate for the nonlocal interactions of each point with points outside the domain. While existing 2D nonlocal flux boundary conditions have been shown to exhibit at most first order convergence to the local counter part as δ → 0, the proposed Neumann-type boundary formulation recovers the local case as O(δ2) in the L∞(ω) norm, which is optimal considering the O(δ2) convergence of the nonlocal equation to its local limit away from the boundary. We analyze the application of this new boundary treatment to the nonlocal diffusion problem, and present conditions under which the solution of the nonlocal boundary value problem converges to the solution of the corresponding local Neumann problem as the horizon is reduced. To demonstrate the applicability of this nonlocal flux boundary condition to more complicated scenarios, we extend the approach to less regular domains, numerically verifying that we preserve second-order convergence for non-convex domains with corners. Based on the new formulation for nonlocal boundary condition, we develop an asymptotically compatible meshfree discretization, obtaining a solution to the nonlocal diffusion equation with mixed boundary conditions that converges with O(δ2) convergence.
Abstract not provided.
Computer Aided Chemical Engineering
In power grid operation, optimal power flow (OPF) problems are solved several times per day to find economically optimal generator setpoints that balance given load demands. Ideally, we seek an optimal solution that is also “N-1 secure”, meaning the system can absorb contingency events such as transmission line or generator failure without loss of service. Current practice is to solve the OPF problem and then check a subset of contingencies against heuristic values, resulting in, at best, suboptimal solutions. Unfortunately, online solution of the OPF problem including the full N-1 contingencies (i.e., two-stage stochastic programming formulation) is intractable for even modest sized electrical grids. To address this challenge, this work presents an efficient method to embed N-1 security constraints into the solution of the OPF by using Neural Network (NN) models to represent the security boundary. Our approach introduces a novel sampling technique, as well as a tuneable parameter to allow operators to balance the conservativeness of the security model within the OPF problem. Our results show that we are able to solve contingency formulations of larger size grids than reported in literature using non-linear programming (NLP) formulations with embedded NN models to local optimality. Solutions found with the NN constraint have marginally increased computational time but are more secure to contingency events.
Proceedings - 2021 International Conference on Rebooting Computing, ICRC 2021
Boolean functions and binary arithmetic operations are central to standard computing paradigms. Accordingly, many advances in computing have focused upon how to make these operations more efficient as well as exploring what they can compute. To best leverage the advantages of novel computing paradigms it is important to consider what unique computing approaches they offer. However, for any special-purpose co-processor, Boolean functions and binary arithmetic operations are useful for, among other things, avoiding unnecessary I/O on-and-off the co-processor by pre- and post-processing data on-device. This is especially true for spiking neuromorphic architectures where these basic operations are not fundamental low-level operations. Instead, these functions require specific implementation. Here we discuss the implications of an advantageous streaming binary encoding method as well as a handful of circuits designed to exactly compute elementary Boolean and binary operations.
Abstract not provided.
This report presents the results of a collaborative effort under the Verification, Validation, and Uncertainty Quantification (VVUQ) thrust area of the North American Energy Resilience Model (NAERM) program. The goal of the effort described in this report was to integrate the Dakota software with the NAERM software framework to demonstrate sensitivity analysis of a co-simulation for NAERM.
SIAM Journal on Optimization
This paper develops a novel limited-memory method to solve dynamic optimization problems. The memory requirements for such problems often present a major obstacle, particularly for problems with PDE constraints such as optimal flow control, full waveform inversion, and optical tomography. In these problems, PDE constraints uniquely determine the state of a physical system for a given control; the goal is to find the value of the control that minimizes an objective. While the control is often low dimensional, the state is typically more expensive to store. This paper suggests using randomized matrix approximation to compress the state as it is generated and shows how to use the compressed state to reliably solve the original dynamic optimization problem. Concretely, the compressed state is used to compute approximate gradients and to apply the Hessian to vectors. The approximation error in these quantities is controlled by the target rank of the sketch. This approximate first- and second-order information can readily be used in any optimization algorithm. As an example, we develop a sketched trust-region method that adaptively chooses the target rank using a posteriori error information and provably converges to a stationary point of the original problem. Numerical experiments with the sketched trust-region method show promising performance on challenging problems such as the optimal control of an advection-reaction-diffusion equation and the optimal control of fluid flow past a cylinder.
SIAM Journal on Scientific Computing
Polynomial preconditioning can improve the convergence of the Arnoldi method for computing eigenvalues. Such preconditioning significantly reduces the cost of orthogonalization; for difficult problems, it can also reduce the number of matrix-vector products. Parallel computations can particularly benefit from the reduction of communication-intensive operations. The GMRES algorithm provides a simple and effective way of generating the preconditioning polynomial. For some problems high degree polynomials are especially effective, but they can lead to stability problems that must be mitigated. A two-level "double polynomial preconditioning"strategy provides an effective way to generate high-degree preconditioners.
2019 15th Hypervelocity Impact Symposium, HVIS 2019
In this work we evaluated the effects that equations of state and strength models have on SCJ development using the Sandia National Laboratories multiphysics shock code, ALEGRA. Results were quantified using a Lagrangian tracer particle following liner collapse, passing through the compression zone, and flowing into the jet tip. We found consistent results among several EOS: 3320, 3331, and 3337. The 3325 EOS generated a measurable low density and hollow region near the jet tip which appears to be reflected in a lower internal energy of the jet. At this time, we cannot tell, experimentally, if such a hollow region exists. The 3337 EOS is recent, well documented [6], and produces results similar to 3320 [3]. The various strength models produced more noticeable differences. In terms of internal energy and temperature, SGL had the largest values followed by PTW, ZA, and finally JC and MTS, which were quite similar to each other. We looked at melt conditions in the SGL and JC models using the 3337 EOS. The SGL model reported a liquid region along the jet axis all the way to the tip-seemingly consistent with experiment-while the JC model does not indicate any phase transition. None of the other yield models indicated melt along the jet axis. For all EOS and strength models, we found similar results for the velocity history of the jet tip as measured against experiment using photon Dopper velocimetry.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Poisson Tensor Factorization (PTF) is an important data analysis method for analyzing patterns and relationships in multiway count data. In this work, we consider several algorithms for computing a low-rank PTF of tensors with sparse count data values via maximum likelihood estimation. Such an approach reduces to solving a nonlinear, non-convex optimization problem, which can leverage considerable parallel computation due to the structure of the problem. However, since the maximum likelihood estimator corresponds to the global minimizer of this optimization problem, it is important to consider how effective methods are at both leveraging this inherent parallelism as well as computing a good approximation to the global minimizer. In this work we present comparisons of multiple methods for PTF that illustrate the tradeoffs in computational efficiency and accurately computing the maximum likelihood estimator. We present results using synthetic and real-world data tensors to demonstrate some of the challenges when choosing a method for a given tensor.
Communications in Computer and Information Science
Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.
This report describes the high-level accomplishments from the Plasma Science and Engineering Grand Challenge LDRD at Sandia National Laboratories. The Laboratory has a need to demonstrate predictive capabilities to model plasma phenomena in order to rapidly accelerate engineering development in several mission areas. The purpose of this Grand Challenge LDRD was to advance the fundamental models, methods, and algorithms along with supporting electrode science foundation to enable a revolutionary shift towards predictive plasma engineering design principles. This project integrated the SNL knowledge base in computer science, plasma physics, materials science, applied mathematics, and relevant application engineering to establish new cross-laboratory collaborations on these topics. As an initial exemplar, this project focused efforts on improving multi-scale modeling capabilities that are utilized to predict the electrical power delivery on large-scale pulsed power accelerators. Specifically, this LDRD was structured into three primary research thrusts that, when integrated, enable complex simulations of these devices: (1) the exploration of multi-scale models describing the desorption of contaminants from pulsed power electrodes, (2) the development of improved algorithms and code technologies to treat the multi-physics phenomena required to predict device performance, and (3) the creation of a rigorous verification and validation infrastructure to evaluate the codes and models across a range of challenge problems. These components were integrated into initial demonstrations of the largest simulations of multi-level vacuum power flow completed to-date, executed on the leading HPC computing machines available in the NNSA complex today. These preliminary studies indicate relevant pulsed power engineering design simulations can now be completed in (of order) several days, a significant improvement over pre-LDRD levels of performance.
World Congress in Computational Mechanics and ECCOMAS Congress
Software development for high-performance scientific computing continues to evolve in response to increased parallelism and the advent of on-node accelerators, in particular GPUs. While these hardware advancements have the potential to significantly reduce turnaround times, they also present implementation and design challenges for engineering codes. We investigate the use of two strategies to mitigate these challenges: the Kokkos library for performance portability across disparate architectures, and the DARMA/vt library for asynchronous many-task scheduling. We investigate the application of Kokkos within the NimbleSM finite element code and the LAMÉ constitutive model library. We explore the performance of DARMA/vt applied to NimbleSM contact mechanics algorithms. Software engineering strategies are discussed, followed by performance analyses of relevant solid mechanics simulations which demonstrate the promise of Kokkos and DARMA/vt for accelerated engineering simulators.
Abstract not provided.
35th AAAI Conference on Artificial Intelligence, AAAI 2021
We present a method for learning dynamics of complex physical processes described by time-dependent nonlinear partial differential equations (PDEs). Our particular interest lies in extrapolating solutions in time beyond the range of temporal domain used in training. Our choice for a baseline method is physics-informed neural network (PINN) because the method parameterizes not only the solutions, but also the equations that describe the dynamics of physical processes. We demonstrate that PINN performs poorly on extrapolation tasks in many benchmark problems. To address this, we propose a novel method for better training PINN and demonstrate that our newly enhanced PINNs can accurately extrapolate solutions in time. Our method shows up to 72% smaller errors than existing methods in terms of the standard L2-norm metric.
AIAA Scitech 2021 Forum
Environmental Barrier Coatings (EBC) protect ceramic matrix composites from exposure to high temperature moisture present in turbine operation through their dense top coats. However, moisture is able to diffuse and oxidize the Si bond coat to form the Thermally Grown Oxide (TGO), a layer of SiO2 where the incorporation of O causes swelling and stress. At sufficient TGO-based swelling, the EBC will fail due to increased damage such as delamination. A multiscale simulation framework has been developed to link operating conditions of a high-performance turbine to the failure modes of the EBC. Computational fluid dynamics (CFD) simulations of the E3 turbine were performed and compared to prior literature data to demonstrate the fidelity of the Loci/CHEM software to determine the flow conditions on the turbine blade surface. Boundary condition data of pressure and heat flux were then determined with the CFD simulations, providing the temperature at the bond coat. Peridynamics was used to model the microscale TGO growth. A swelling model that links moisture concentration to strain at the TGO due to the volume increase from oxidation was demonstrated, coupling moisture transport to localized strain and directly observing TGO growth and the corresponding damage. This framework is generalized and can be adapted to a range of EBC microstructures and operating conditions.
2019 15th Hypervelocity Impact Symposium, HVIS 2019
The shock hydrodynamics code ALEGRA and the optimization and uncertainty quantification toolkit Dakota are used to calibrate and select between three competing steel yield models, taking uncertainties in the system into account. A Bayesian model selection procedure is used to choose between the models in a systematic, automated fashion, within an uncertainty quantification workflow. Time-series penetration data of a long tungsten-alloy rod impacting a hardened steel plate at approximately 1250 m/s, along with their measurement uncertainty, are used to calibrate and select between the models. The procedure finds that between the Johnson–Cook, Steinberg–Guinan–Lund, and Zerilli–Armstrong stress models, Zerilli–Armstrong performs the best.
International Journal for Uncertainty Quantification
We propose a learning algorithm for discovering unknown parameterized dynamical systems by using observational data of the state variables. Our method is built upon and extends the recent work of discovering unknown dynamical systems, in particular those using a deep neural network (DNN). We propose a DNN structure, largely based upon the residual network (ResNet), to not only learn the unknown form of the governing equation but also to take into account the random effect embedded in the system, which is generated by the random parameters. Once the DNN model is successfully constructed, it is able to produce system prediction over a longer term and for arbitrary parameter values. For uncertainty quantification, it allows us to conduct uncertainty analysis by evaluating solution statistics over the parameter space.