Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if - and only if - the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trapped-Yb+-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10-4).
We provide the first demonstration that molecular-level methods based on gas kinetic theory and molecular chaos can simulate turbulence and its decay. The direct simulation Monte Carlo (DSMC) method, a molecular-level technique for simulating gas flows that resolves phenomena from molecular to hydrodynamic (continuum) length scales, is applied to simulate the Taylor-Green vortex flow. The DSMC simulations reproduce the Kolmogorov -5/3 law and agree well with the turbulent kinetic energy and energy dissipation rate obtained from direct numerical simulation of the Navier-Stokes equations using a spectral method. This agreement provides strong evidence that molecular-level methods for gases can be used to investigate turbulent flows quantitatively.
International Journal of Computational Fluid Dynamics
Gel, Aytekin; Hu, Jonathan J.; Ould-Ahmed-Vall, El M.; Kalinkin, Alexander A.
Legacy codes remain a crucial element of today's simulation-based engineering ecosystem due to the extensive validation process and investment in such software. The rapid evolution of high-performance computing architectures necessitates the modernization of these codes. One approach to modernization is a complete overhaul of the code. However, this could require extensive investments, such as rewriting in modern languages, new data constructs, etc., which will necessitate systematic verification and validation to re-establish the credibility of the computational models. The current study advocates using a more incremental approach and is a culmination of several modernization efforts of the legacy code MFIX, which is an open-source computational fluid dynamics code that has evolved over several decades, widely used in multiphase flows and still being developed by the National Energy Technology Laboratory. Two different modernization approaches,‘bottom-up’ and ‘top-down’, are illustrated. Preliminary results show up to 8.5x improvement at the selected kernel level with the first approach, and up to 50% improvement in total simulated time with the latter were achieved for the demonstration cases and target HPC systems employed.
Time integration is a central component for most transient simulations. It coordinates many of the major parts of a simulation together, e.g., a residual calculation with a transient solver, solution with the output, various operator-split physics, and forward and adjoint solutions for inversion. Even though there is this variety in these transient simulations, there is still a common set of algorithms and procedures to progress transient solutions for ordinary-differential equations (ODEs) and differential-alegbraic equations (DAEs). Rythmos is a collection of these algorithms that can be used for the solution of transient simulations. It provides common time-integration methods, such as Backward and Forward Euler, Explicit and Implicit Runge-Kutta, and Backward-Difference Formulas. It can also provide sensitivities, and adjoint components for transient simulations. Rythmos is a package within Trilinos, and requires some other packages (e.g., Teuchos and Thrya) to provide basic time-integration capabilities. It also can be coupled with several other Trilinos packages to provide additional capabilities (e.g., AztecOO and Belos for linear solutions, and NOX for non-linear solutions). The documentation is broken down into three parts: Theory Manual, User's Manual, and Developer's Guide. The Theory Manual contains the basic theory of the time integrators, the nomenclature and mathematical structure utilized within Rythmos, and verification results demonstrating that the designed order of accuracy is achieved. The User's Manual provides information on how to use the Rythmos, description of input parameters through Teuchos Parameter Lists, and description of convergence test examples. The Developer's Guide is a high-level discussion of the design and structure of Rythmos to provide information to developers for the continued development of capabilities. Details of individual components can be found in the Doxygen webpages.
Solving sparse linear systems from the discretization of elliptic partial differential equations (PDEs) is an important building block in many engineering applications. Sparse direct solvers can solve general linear systems, but are usually slower and use much more memory than effective iterative solvers. To overcome these two disadvantages, a hierarchical solver (LoRaSp) based on H2-matrices was introduced in [22]. Here, we have developed a parallel version of the algorithm in LoRaSp to solve large sparse matrices on distributed memory machines. On a single processor, the factorization time of our parallel solver scales almost linearly with the problem size for three-dimensional problems, as opposed to the quadratic scalability of many existing sparse direct solvers. Moreover, our solver leads to almost constant numbers of iterations, when used as a preconditioner for Poisson problems. On more than one processor, our algorithm has significant speedups compared to sequential runs. With this parallel algorithm, we are able to solve large problems much faster than many existing packages as demonstrated by the numerical experiments.
This report describes findings from the culminating experiment of the LDRD project entitled, "Analyst-to-Analyst Variability in Simulation-Based Prediction". For this experiment, volunteer participants solving a given test problem in engineering and statistics were interviewed at different points in their solution process. These interviews are used to trace differing solutions to differing solution processes, and differing processes to differences in reasoning, assumptions, and judgments. The issue that the experiment was designed to illuminate -- our paucity of understanding of the ways in which humans themselves have an impact on predictions derived from complex computational simulations -- is a challenging and open one. Although solution of the test problem by analyst participants in this experiment has taken much more time than originally anticipated, and is continuing past the end of this LDRD, this project has provided a rare opportunity to explore analyst-to-analyst variability in significant depth, from which we derive evidence-based insights to guide further explorations in this important area.
Improved validation for models of complex systems has been a primary focus over the past year for the Resilience in Complex Systems Research Challenge. This document describes a set of research directions that are the result of distilling those ideas into three categories of research -- epistemic uncertainty, strong tests, and value of information. The content of this document can be used to transmit valuable information to future research activities, update the Resilience in Complex Systems Research Challenge's roadmap, inform the upcoming FY18 Laboratory Directed Research and Development (LDRD) call and research proposals, and facilitate collaborations between Sandia and external organizations. The recommended research directions can provide topics for collaborative research, development of proposals, workshops, and other opportunities.
Remote temperature sensing is essential for applications in enclosed vessels, where feedthroughs or optical access points are not possible. A unique sensing method for measuring the temperature of multiple closely spaced points is proposed using permanent magnets and several three-axis magnetic field sensors. The magnetic field theory for multiple magnets is discussed and a solution technique is presented. Experimental calibration procedures, solution inversion considerations, and methods for optimizing the magnet orientations are described in order to obtain low-noise temperature estimates. The experimental setup and the properties of permanent magnets are shown. Finally, experiments were conducted to determine the temperature of nine magnets in different configurations over a temperature range of 5 °C to 60 °C and for a sensor-to-magnet distance of up to 35 mm. To show the possible applications of this sensing system for measuring temperatures through metal walls, additional experiments were conducted inside an opaque 304 stainless steel cylinder.
DRAM technology is the main building block of main memory, however, DRAM scaling is becoming very challenging. The main issues for DRAM scaling are the increasing error rates with each new generation, the geometric and physical constraints of scaling the capacitor part of the DRAM cells, and the high power consumption caused by the continuous need for refreshing cell values. At the same time, emerging Non- Volatile Memory (NVM) technologies, such as Phase-Change Memory (PCM), are emerging as promising replacements for DRAM. NVMs, when compared to current technologies e.g., NAND-based ash, have latencies comparable to DRAM. Additionally, NVMs are non-volatile, which eliminates the need for refresh power and enables persistent memory applications. Finally, NVMs have promising densities and the potential for multi-level cell (MLC) storage.
This document provides a detailed overview of the stereo correlation algorithm and triangulation formulation used in the Digital Image Correlation Engine (DICe) to triangulate three dimensional motion in space given the image coordinates and camera calibration parameters.
The familiar story of Moore's law is actually inaccurate. This article corrects the story, leading to different projections for the future. Moore's law is a fluid idea whose definition changes over time. It thus doesn't have the ability to 'end,' as is popularly reported, but merely takes different forms as the semiconductor and computer industries evolve.
Si-MOS based QD qubits are attractive due to their similarity to the current semiconductor industry. We introduce a highly tunable MOS foundry compatible qubit design that couples an electrostatic quantum dot (QD) with an implanted donor. We show for the first time coherent two-axis control of a two-electron spin logical qubit that evolves under the QD-donor exchange interaction and the hyperfine interaction with the donor nucleus. The two interactions are tuned electrically with surface gate voltages to provide control of both qubit axes. Qubit decoherence is influenced by charge noise, which is of similar strength as epitaxial systems like GaAs and Si/SiGe.
Dynamic materials experiments on the Z-machine are beginning to reach a regime where traditional analysis techniques break down. Time dependent phenomena such as strength and phase transition kinetics often make the data obtained in these experiments difficult to interpret. We present an inverse analysis methodology to infer the equation of state (EOS) from velocimetry data in these types of experiments, building on recent advances in the propagation of uncertain EOS information through a hydrocode simulation. An example is given for a shock-ramp experiment in which tantalum was shock compressed to 40 GPa followed by a ramp to 80 GPa. The results are found to be consistent with isothermal compression and Hugoniot data in this regime.
We present a new meshless method for scalar diffusion equations, which is motivated by their compatible discretizations on primal-dual grids. Unlike the latter though, our approach is truly meshless because it only requires the graph of nearby neighbor connectivity of the discretization points xi. This graph defines a local primal-dual grid complex with a virtual dual grid, in the sense that specification of the dual metric attributes is implicit in the method's construction. Our method combines a topological gradient operator on the local primal grid with a generalized moving least squares approximation of the divergence on the local dual grid. We show that the resulting approximation of the div-grad operator maintains polynomial reproduction to arbitrary orders and yields a meshless method, which attains O(hm) convergence in both L2- and H1-norms, similar to mixed finite element methods. We demonstrate this convergence on curvilinear domains using manufactured solutions in two and three dimensions. Application of the new method to problems with discontinuous coefficients reveals solutions that are qualitatively similar to those of compatible mesh-based discretizations.
Many engineering design problems can be formulated in the framework of partial differential equation (PDE) constrained optimization. The discretization of a PDE leads to multiple levels of resolution with varying degrees of numerical solution accuracy. Coarse discretizations require less computational time at the expense of increased error. Often there are also reduced fidelity models available, with simplifications to the physics models that are computationally easier to solve. This research develops an up to second-order consistent multilevel-multifidelity (MLMF) optimization scheme that exploits the reduced cost resulting from coarse discretization and reduced fidelity to more efficiently converge to the optimum of a fine-grid high-fidelity problem. This scheme distinguishes multilevel approaches applied to discretizations from multifidelity approaches applied to model forms, and navigates both hierarchies to accelerate convergence. Additive, multiplicative, or a combination of both corrections can be applied to the sub-problems to enforce up to second-order consistency with the fine-grid high-fidelity results. The MLMF optimization algorithm is a wrapper around a subproblem optimization solver, and the MLMF scheme is provably convergent if the subproblem optimizer is provably convergent. Heuristics are developed for efficiently tuning optimization tolerances and iterations at each level and fidelity based on relative solution cost. Accelerated convergence is demonstrated for a simple one-dimensional problem and aerodynamic shape optimization of a transonic airfoil.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Mkrtchyan, Vahan; Parekh, Ojas D.; Segev, Danny; Subramani, K.
Motivated by applications in risk management of computational systems, we focus our attention on a special case of the partial vertex cover problem, where the underlying graph is assumed to be a tree. Here, we consider four possible versions of this setting, depending on whether vertices and edges are weighted or not. Two of these versions, where edges are assumed to be unweighted, are known to be polynomial-time solvable. However, the computational complexity of this problem with weighted edges, and possibly with weighted vertices, has not been determined yet. The main contribution of this paper is to resolve these questions by fully characterizing which variants of partial vertex cover remain intractable in trees, and which can be efficiently solved. In particular, we propose a pseudo-polynomial DP-based algorithm for the most general case of having weights on both edges and vertices, which is proven to be NP-hard. This algorithm provides a polynomialtime solution method when weights are limited to edges, and combined with additional scaling ideas, leads to an FPTAS for the general case. A secondary contribution of this work is to propose a novel way of using centroid decompositions in trees, which could be useful in other settings as well.
We propose, theoretically investigate, and numerically validate an algorithm for the Monte Carlo solution of least-squares polynomial approximation problems in a collocation framework. Our investigation is motivated by applications in the collocation approximation of parametric functions, which frequently entails construction of surrogates via orthogonal polynomials. A standard Monte Carlo approach would draw samples according to the density defining the orthogonal polynomial family. Our proposed algorithm instead samples with respect to the (weighted) pluripotential equilibrium measure of the domain, and subsequently solves a weighted least-squares problem, with weights given by evaluations of the Christoffel function. We present theoretical analysis to motivate the algorithm, and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest.
Progressive hedging, though an effective heuristic for solving stochastic mixed integer programs (SMIPs), is not guaranteed to converge in this case. Here, we describe BBPH, a branch and bound algorithm that uses PH at each node in the search tree such that, given sufficient time, it will always converge to a globally optimal solution. In addition to providing a theoretically convergent “wrapper” for PH applied to SMIPs, computational results demonstrate that for some difficult problem instances branch and bound can find improved solutions after exploring only a few nodes.
Biological neural networks continue to inspire new developments in algorithms and microelectronic hardware to solve challenging data processing and classification problems. Here, we survey the history of neural-inspired and neuromorphic computing in order to examine the complex and intertwined trajectories of the mathematical theory and hardware developed in this field. Early research focused on adapting existing hardware to emulate the pattern recognition capabilities of living organisms. Contributions from psychologists, mathematicians, engineers, neuroscientists, and other professions were crucial to maturing the field from narrowly-tailored demonstrations to more generalizable systems capable of addressing difficult problem classes such as object detection and speech recognition. Algorithms that leverage fundamental principles found in neuroscience such as hierarchical structure, temporal integration, and robustness to error have been developed, and some of these approaches are achieving world-leading performance on particular data classification tasks. In addition, novel microelectronic hardware is being developed to perform logic and to serve as memory in neuromorphic computing systems with optimized system integration and improved energy efficiency. Key to such advancements was the incorporation of new discoveries in neuroscience research, the transition away from strict structural replication and towards the functional replication of neural systems, and the use of mathematical theory frameworks to guide algorithm and hardware developments.
DeBenedictis, Erik; Badaroglu, Mustafa; Chen, An; Conte, Thomas M.; Gargini, Paolo
Rather than continue the expensive and time-consuming quest for transistor replacement, the authors argue that 3D chips coupled with new computer architectures can keep Moore's law on its traditional scaling path.
We propose an algorithm for recovering sparse orthogonal polynomial expansions via collocation. A standard sampling approach for recovering sparse polynomials uses Monte Carlo sampling, from the density of orthogonality, which results in poor function recovery when the polynomial degree is high. Our proposed approach aims to mitigate this limitation by sampling with respect to the weighted equilibrium measure of the parametric domain and subsequently solves a preconditioned'1-minimization problem, where the weights of the diagonal preconditioning matrix are given by evaluations of the Christoffel function. Our algorithm can be applied to a wide class of orthogonal polynomial families on bounded and unbounded domains, including all classical families. We present theoretical analysis to motivate the algorithm and numerical results that show our method is superior to standard Monte Carlo methods in many situations of interest. Numerical examples are also provided to demonstrate that our proposed algorithm leads to comparable or improved accuracy even when compared with Legendre- and Hermite-specific algorithms.
In many settings, multi-tasking and interruption are commonplace. Multi-tasking has been a popular subject of recent research, but a multitasking paradigm normally allows the subject some control over the timing of the task switch. In this paper we focus on interruptions—situations in which the subject has no control over the timing of task switches. We consider three types of task: verbal (reading comprehension), visual search, and monitoring/situation awareness. Using interruptions from 30 s to 2 min in duration, we found a significant effect in each case, but with different effect sizes. For the situation awareness task, we experimented with interruptions of varying duration and found a non-linear relation between the duration of the interruption and its after-effect on performance, which may correspond to a task-dependent interruption threshold, which is lower for more dynamic tasks.
Extreme-scale computational science increasingly demands multiscale and multiphysics formulations. Combining software developed by independent groups is imperative: no single team has resources for all predictive science and decision support capabilities. Scientific libraries provide high-quality, reusable software components for constructing applications with improved robustness and portability. However, without coordination, many libraries cannot be easily composed. Namespace collisions, inconsistent arguments, lack of third-party software versioning, and additional difficulties make composition costly. The Extreme-scale Scientific Software Development Kit (xSDK) defines community policies to improve code quality and compatibility across independently developed packages (hypre, PETSc, SuperLU, Trilinos, and Alquimia) and provides a foundation for addressing broader issues in software interoperability, performance portability, and sustainability. The xSDK provides turnkey installation of member software and seamless combination of aggregate capabilities, and it marks first steps toward extreme-scale scientific software ecosystems from which future applications can be composed rapidly with assured quality and scalability.
Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.
Many applications, such as PDE based simulations and machine learning, apply BLAS/LAPACK routines to large groups of small matrices. While existing batched BLAS APIs provide meaningful speedup for this problem type, a non-canonical data layout enabling cross-matrix vectorization may provide further significant speedup. In this paper, we propose a new compact data layout that interleaves matrices in blocks according to the SIMD vector length. We combine this compact data layout with a new interface to BLAS/LAPACK routines that can be used within a hierarchical parallel application. Our layout provides up to 14 ×, 45 ×, and 27 × speedup against OpenMP loops around optimized DGEMM, DTRSM and DGETRF kernels, respectively, on the Intel Knights Landing architecture. We discuss the compact batched BLAS/LAPACK implementations in two libraries, KokkosKernels and Intel® Math Kernel Library. We demonstrate the APIs in a line solver for coupled PDEs. Finally, we present detailed performance analysis of our kernels.
The dentate gyrus forms a critical link between the entorhinal cortex and CA3 by providing a sparse version of the signal. Concurrent with this increase in sparsity, a widely accepted theory suggests the dentate gyrus performs pattern separation-similar inputs yield decorrelated outputs. Although an active region of study and theory, few logically rigorous arguments detail the dentate gyrus's (DG) coding.We suggest a theoretically tractable, combinatorial model for this action. The model provides formal methods for a highly redundant, arbitrarily sparse, and decorrelated output signal. To explore the value of this model framework, we assess how suitable it is for two notable aspects of DG coding: how it can handle the highly structured grid cell representation in the input entorhinal cortex region and the presence of adult neurogenesis, which has been proposed to produce a heterogeneous code in the DG.We find tailoring themodel to grid cell input yields expansion parameters consistent with the literature. In addition, the heterogeneous coding reflects activity gradation observed experimentally. Finally,we connect this approach with more conventional binary threshold neural circuit models via a formal embedding.
Researchers are now considering alternatives to the von Neumann computer architecture as a way to improve performance. The current approach of simulating benchmark applications favors continued use of the von Neumann architecture, but architects can help overcome this bias.
In many aerospace applications, it is critical to be able to model fluid-structure interactions. In particular, correctly predicting the power spectral density of pressure fluctuations at surfaces can be important for assessing potential resonances and failure modes. Current turbulence modeling methods, such as wall-modeled Large Eddy Simulation and Detached Eddy Simulation, cannot reliably predict these pressure fluctuations for many applications of interest. The focus of this paper is on efforts to use data-driven machine learning methods to learn correction terms for the wall pressure fluctuation spectrum. In particular, the non-locality of the wall pressure fluctuations in a compressible boundary layer is investigated using random forests and neural networks trained and evaluated on Direct Numerical Simulation data.
We establish an atomistic view of the high- and low-temperature phases of iron/steel as well as some elements of the phase transition between these phases on cooling. In particular we examine the 4 most common orientation relationships between the high temperature austenite and low-temperature ferrite phases seen in experiment. With a thorough understanding of these relationships we are prepared to set up various atomistic simulations, using techniques such as Density Functional Theory and Molecular Dynamics, to further study the phase transition, in particular, quantities needed for Phase Field Modeling, such as the free energies of bulk phases and the phase transition front propagation velocity.
The development of scramjet engines is an important research area for advancing hypersonic and orbital flights. Progress towards optimal engine designs requires both accurate flow simulations as well as uncertainty quantification (UQ). However, performing UQ for scramjet simulations is challenging due to the large number of uncertain parameters involved and the high computational cost of flow simulations. We address these difficulties by combining UQ algorithms and numerical methods to the large eddy simulation of the HIFiRE scramjet configuration. First, global sensitivity analysis is conducted to identify influential uncertain input parameters, helping reduce the stochastic dimension of the problem and discover sparse representations. Second, as models of different fidelity are available and inevitably used in the overall UQ assessment, a framework for quantifying and propagating the uncertainty due to model error is introduced. These methods are demonstrated on a non-reacting scramjet unit problem with parameter space up to 24 dimensions, using 2D and 3D geometries with static and dynamic treatments of the turbulence subgrid model.
When solving partial differential equations (PDEs) with random inputs, it is often computationally inefficient to merely propagate samples of the input probability law (or an approximation thereof) because the input law may not accurately capture the behavior of critical system responses that depend on the PDE solution. To further complicate matters, in many applications it is critical to accurately approximate the “risk” associated with the statistical tails of the system responses, not just the statistical moments. In this paper, we develop an adaptive sampling and local reduced basis method for approximately solving PDEs with random inputs. Our method determines a set of parameter atoms and an associated (implicit) Voronoi partition of the parameter domain on which we build local reduced basis approximations of the PDE solution. In addition, we extend our adaptive sampling approach to accurately compute measures of risk evaluated at quantities of interest that depend on the PDE solution.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Haddock, Walker; Curry, Matthew L.; Bangalore, Purushotham V.; Skjellum, Anthony
High-performance computing (HPC) demands high bandwidth and low latency in I/O performance leading to the development of storage systems and I/O software components that strive to provide greater and greater performance. However, capital and energy budgets along with increasing storage capacity requirements have motivated the search for lower cost, large storage systems for HPC. With Burst Buffer technology increasing the bandwidth and reducing the latency for I/O between the compute and storage systems, the back-end storage bandwidth and latency requirements can be reduced, especially underneath an adequately sized modern parallel file system. Cloud computing has led to the development of large, low-cost storage solutions where design has focused on high capacity, availability, and low energy consumption at lowest cost. Cloud computing storage systems leverage duplicates and erasure coding technology to provide high availability at much lower cost than traditional HPC storage systems. Leveraging certain cloud storage infrastructure and concepts in HPC would be valuable economically in terms of cost-effective performance for certain storage tiers. To enable the use of cloud storage technologies for HPC we study the architecture for interfacing cloud storage between the HPC parallel file systems and the archive storage. In this paper, we report our comparison of two erasure coding implementations for the Ceph file system. We compare measurements of various degrees of sharding that are relevant for HPC applications. We show that the Gibraltar GPU Erasure coding library outperforms a CPU implementation of an erasure coding plugin for the Ceph object storage system, opening the potential for new ways to architect such storage systems based on Ceph.
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
Visual inspection research has a long history spanning the 20th century and continuing to the present day. Current efforts in multiple venues demonstrate that visual inspection continues to have a vital role for many different types of tasks in the 21st century. The nature of this role spans the range from traditional human visual inspection to fully automated detection of defects. Consequently, today's practitioners must not only successfully identify and apply lessons learned from the past, but also explore new areas of research in order to derive solutions for modern day issues such as those presented by introducing automation during inspection. A key lesson from past research indicates that the factors that can degrade performance will persist today, unless care is taken to design the inspection process appropriately.
The purpose of this document is to compare and contrast metrics that may be considered for use in validating computational models. Metrics suitable for use in one application, scenario, and/or quantity of interest may not be acceptable in another; these notes merely provide information that may be used as guidance in selecting a validation metric.
We consider heuristic and optimal solutions to a discrete geometric bin packing problem that arises in a resource allocation problem. An imaging sensor is assigned to collect data over a large area, but some subregions are more valuable than others. To capture these high-value regions with higher fidelity, we can assign some number of non-overlapping rectangular subsets, called “subfootprints.” The sensor image is partitioned into squares called “chips”, and each chip is further partitioned into pixels. Pixels may have different values. Subfootprints are restricted to rectangular collections of chips, but we are free to choose different rectangle heights, widths, and areas. We seek the optimal arrangement over the family of possible rectangle shapes and sizes. We provide a mixed-integer linear program optimization formulation, as well as a greedy heuristic, to solve this problem. For the meta-problem, we have some freedom to align the chip boundaries to different pixels. However, it is too expensive to solve the optimization formulation for each alignment. However, we show that the greedy heuristic can inform which pixel alignments are worth solving the optimization over. We use a variant of k-means clustering to group greedy solutions by their transport shape-similarity. For each cluster, we run the optimization problem over the greedy layout with the highest value. In practice this efficiently explores the geometric configuration space, and produces solutions close to the global optimum. We show a contrived example using surveillance of the Mississippi River. Our software is available as open-source in the Github repository “GeoPlace .
Information loss from a computation implies energy dissipation due to Landauer’s Principle. Thus, increasing the amount of useful computational work that can be accomplished within a given energy budget will eventually require increasing the degree to which our computing technologies avoid information loss, i.e., are logically reversible. But the traditional definition of logical reversibility is actually more restrictive than is necessary to avoid information loss and energy dissipation due to Landauer’s Principle. As a result, the operations that have traditionally been viewed as the atomic elements of reversible logic, such as Toffoli gates, are not really the simplest primitives that one can use for the design of reversible hardware. Arguably, a complete theoretical framework for reversible computing should provide a more general, parsimonious foundation for practical engineering. To this end, we use a rigorous quantitative formulation of Landauer’s Principle to develop the theory of Generalized Reversible Computing (GRC), which precisely characterizes the minimum requirements for a computation to avoid information loss and the consequent energy dissipation, showing that a much broader range of computations are, in fact, reversible than is acknowledged by traditional reversible computing theory. This paper summarizes the foundations of GRC theory and briefly presents a few of its applications.
Water utilities are vulnerable to a wide variety of human-caused and natural disasters. The Water Network Tool for Resilience (WNTR) is a new open source Python™ package designed to help water utilities investigate resilience of water distribution systems to hazards and evaluate resilience-enhancing actions. In this paper, the WNTR modeling framework is presented and a case study is described that uses WNTR to simulate the effects of an earthquake on a water distribution system. The case study illustrates that the severity of damage is not only a function of system integrity and earthquake magnitude, but also of the available resources and repair strategies used to return the system to normal operating conditions. While earthquakes are particularly concerning since buried water distribution pipelines are highly susceptible to damage, the software framework can be applied to other types of hazards, including power outages and contamination incidents.
With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this paper, we present a novel framework that uses machine learning to automatically diagnose previously encountered performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs. We first convert the collected time series data into statistical features that retain application characteristics to significantly reduce the computational overhead of our technique. We then use machine learning algorithms to learn anomaly characteristics from this historical data and to identify the types of anomalies observed while running applications. We evaluate our framework both on an HPC cluster and on a public cloud, and demonstrate that our approach outperforms current state-of-the-art techniques in detecting anomalies, reaching an F-score over 0.97.
We show how to place a set of seed points such that a given piecewise linear complex is the union of some faces in the resulting Voronoi diagram. The seeds are placed on sufficiently small spheres centered at input vertices and are arranged into little circles around each half-edge where every seed is mirrored across the associated triangle. The Voronoi faces common to the seeds of such arrangements yield a mesh conforming to the input complex. If the input contains sharp angles, then additional seeds are needed, analogous to nonobtuse refinement. Finally, we propose local optimizations to reduce the number of seeds and output facets.
Optimizing communication performance is imperative for largescale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.
The Rim-to-Rim Wearables At The Canyon for Health (R2R WATCH) study examines metrics recordable on commercial off the shelf (COTS) devices that are most relevant and reliable for the earliest possible indication of a health or performance decline. This is accomplished through collaboration between Sandia National Laboratories (SNL) and The University of New Mexico (UNM) where the two organizations team up to collect physiological, cognitive, and biological markers from volunteer hikers who attempt the Rim-to-Rim (R2R) hike at the Grand Canyon. Three forms of data are collected as hikers travel from rim to rim: physiological data through wearable devices, cognitive data through a cognitive task taken every 3 hours, and blood samples obtained before and after completing the hike. Data is collected from both civilian and warfighter hikers. Once the data is obtained, it is analyzed to understand the effectiveness of each COTS device and the validity of the data collected. We also aim to identify which physiological and cognitive phenomena collected by wearable devices are the most relatable to overall health and task performance in extreme environments, and of these ascertain which markers provide the earliest yet reliable indication of health decline. Finally, we analyze the data for significant differences between civilians’ and warfighters’ markers and the relationship to performance. This is a study funded by the Defense Threat Reduction Agency (DTRA, Project CB10359) and the University of New Mexico (The main portion of the R2R WATCH study is funded by DTRA. UNM is currently funding all activities related to bloodwork. DTRA, Project CB10359; SAND2017-1872 C). This paper describes the experimental design and methodology for the first year of the R2R WATCH project.
Industry's inability to reduce logic gates' energy consumption is slowing growth in an important part of the worldwide economy. Some scientists argue that alternative approaches could greatly reduce energy consumption. These approaches entail myriad technical and political issues.
Approximate methods for electronic structure, implemented in sophisticated computer codes and married to ever-more powerful computing platforms, have become invaluable in chemistry and materials science. The maturing and consolidation of quantum chemistry codes since the 1980s, based upon explicitly correlated electronic wave functions, has made them a staple of modern molecular chemistry. Here, the impact of first principles electronic structure in physics and materials science had lagged owing to the extra formal and computational demands of bulk calculations.
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
We investigate a novel application of deep neural networks to modeling of errors in prediction of surface pressure fluctuations beneath a compressible, turbulent flow. In this context, the truth solution is given by Direct Numerical Simulation (DNS) data, while the predictive model is a wall-modeled Large Eddy Simulation (LES). The neural network provides a means to map relevant statistical flow-features within the LES solution to errors in prediction of wall pressure spectra. We simulate a number of flat plate turbulent boundary layers using both DNS and wall-modeled LES to build up a database with which to train the neural network. We then apply machine learning techniques to develop an optimized neural network model for the error in terms of relevant flow features.
This paper examines the variability of predicted responses when multiple stress-strain curves (reflecting variability from replicate material tests) are propagated through a transient dynamics finite element model of a ductile steel can being slowly crushed. An elastic-plastic constitutive model is employed in the large-deformation simulations. Over 70 response quantities of interest (including displacements, stresses, strains, and calculated measures of material damage) are tracked in the simulations. Each response quantity’s behavior varies according to the particular stress-strain curves used for the materials in the model. The present work assigns the same material to all the can parts: lids, walls, and weld. We desire to estimate response variability due to variability of the input material curves. When only a few stress-strain curve samples are available from material testing, response variance will usually be significantly underestimated. This is undesirable for many engineering purposes. A simple classical statistical method, Tolerance Intervals, is tested for effectively compensating for sparse stress-strain curve data. The method is found to perform well on the highly nonlinear input-to-output response mappings and non-standard response distributions in the can-crush problem. The results and discussion in this paper, and further studies referenced, support a proposition that the method will apply similarly well for other sparsely sampled random functions.
Emerging novel architectures for shared memory parallel computing are incorporating increasingly creative innovations to deliver higher memory performance. A notable exemplar of this phenomenon is the Multi-Channel DRAM (MCDRAM) that is included in the Intel® XeonPhi™ processors. In this paper, we examine techniques to use OpenMP to exploit the high bandwidth of MCDRAM by staging data. In particular, we implement double buffering using OpenMP sections and tasks to explicitly manage movement of data into MCDRAM. We compare our double-buffered approach to a non-buffered implementation and to Intel’s cache mode, in which the system manages the MCDRAM as a transparent cache. We also demonstrate the sensitivity of performance to parameters such as dataset size and the distribution of threads between compute and copy operations.
Silicon-based metal-oxide-semiconductor quantum dots are prominent candidates for high-fidelity, manufacturable qubits. Due to silicon's band structure, additional low-energy states persist in these devices, presenting both challenges and opportunities. Although the physics governing these valley states has been the subject of intense study, quantitative agreement between experiment and theory remains elusive. Here, we present data from an experiment probing the valley states of quantum dot devices and develop a theory that is in quantitative agreement with both this and a recently reported experiment. Through sampling millions of realistic cases of interface roughness, our method provides evidence that the valley physics between the two samples is essentially the same.
We propose a new particle-in-cell (PIC) method for the simulation of plasmas based on a recently developed, unconditionally stable solver for the wave equation. This method is not subject to a CFL restriction, limiting the ratio of the time step size to the spatial step size, typical of explicit methods, while maintaining computational cost and code complexity comparable to such explicit schemes. We describe the implementation in one and two dimensions for both electrostatic and electromagnetic cases, and present the results of several standard test problems, showing good agreement with theory with time step sizes much larger than allowed by typical CFL restrictions.
Here, first-principles molecular dynamics simulations were used to investigate the dissociation of sarin (GB) on the calcium silicate hydrate (CSH) mineral tobermorite (TBM), a surrogate for cement. CSH minerals (including TBM) and amorphous materials of similar composition are the major components of Portland cement, the binding agent of concrete. Metadynamics simulations were used to investigate the effect of the TBM surface and confinement in a microscale pore on the mechanism and free energy of dissociation of GB. Our results indicate that both the adsorption site and the humidity of the local environment significantly affect the sarin dissociation energy. In particular, sarin dissociation in a low-water environment occurs via a dealkylation mechanism, which is consistent with previous experimental studies.
2016 IEEE International Conference on Rebooting Computing, ICRC 2016 - Conference Proceedings
DeBenedictis, Erik; Frank, Michael P.; Ganesh, Natesh; Anderson, Neal G.
At roughly kT energy dissipation per operation, the thermodynamic energy efficiency "limits" of Moore's Law were unimaginably far off in the 1960s. However, current computers operate at only 100-10,000 times this limit, forming an argument that historical rates of efficiency scaling must soon slow. This paper reviews the justification for the ∼kT per operation limit in the context of processors for von Neumann-class computer architectures of the 1960s. We then reapply the fundamental arguments to contemporary applications and identify a new direction for future computing in which the ultimate efficiency limits would be much further out. New nanodevices with high-level functions that aggregate the functionality of several logic gates and some local memory may be the right building blocks for much more energy efficient execution of emerging applications - such as neural networks.
Amidst the rising impact of machine learning and the popularity of deep neural networks, learning theory is not a solved problem. With the emergence of neuromorphic computing as a means of addressing the von Neumann bottleneck, it is not simply a matter of employing existing algorithms on new hardware technology, but rather richer theory is needed to guide advances. In particular, there is a need for a richer understanding of the role of adaptivity in neural learning to provide a foundation upon which architectures and devices may be built. Modern machine learning algorithms lack adaptive learning, in that they are dominated by a costly training phase after which they no longer learn. The brain on the other hand is continuously learning and provides a basis for which new mathematical theories may be developed to greatly enrich the computational capabilities of learning systems. Game theory provides one alternative mathematical perspective analyzing strategic interactions and as such is well suited to learning theory.
Continuing to improve computational energy efficiency will soon require developing and deploying new operational paradigms for computation that circumvent the fundamental thermodynamic limits that apply to conventionally-implemented Boolean logic circuits. In particular, Landauer's principle tells us that irreversible information erasure requires a minimum energy dissipation of kT ln 2 per bit erased, where k is Boltzmann's constant and T is the temperature of the available heat sink. However, correctly applying this principle requires carefully characterizing what actually constitutes "information erasure" within a given physical computing mechanism. In this paper, we show that abstract combinational logic networks can validly be considered to contain no information beyond that specified in their input, and that, because of this, appropriately-designed physical implementations of even multi-layer networks can in fact be updated in a single step while incurring no greater theoretical minimum energy dissipation than is required to update their inputs. Furthermore, this energy can approach zero if the network state is updated adiabatically via a reversible transition process. Our novel operational paradigm for updating logic networks suggests an entirely new class of hardware devices and circuits that can be used to reversibly implement Boolean logic with energy dissipation far below the Landauer limit.
For decades, neural networks have shown promise for next-generation computing, and recent breakthroughs in machine learning techniques, such as deep neural networks, have provided state-of-the-art solutions for inference problems. However, these networks require thousands of training processes and are poorly suited for the precise computations required in scientific or similar arenas. The emergence of dedicated spiking neuromorphic hardware creates a powerful computational paradigm which can be leveraged towards these exact scientific or otherwise objective computing tasks. We forego any learning process and instead construct the network graph by hand. In turn, the networks produce guaranteed success often with easily computable complexity. We demonstrate a number of algorithms exemplifying concepts central to spiking networks including spike timing and synaptic delay. We also discuss the application of cross-correlation particle image velocimetry and provide two spiking algorithms; one uses time-division multiplexing, and the other runs in constant time.
We address practical limits of energy efficiency scaling for logic and memory. Scaling of logic will end with unreliable operation, making computers probabilistic as a side effect. The errors can be corrected or tolerated, but overhead will increase with further scaling. We address the tradeoff between scaling and error correction that yields minimum energy per operation, finding new error correction methods with energy consumption limits about 2× below current approaches. The maximum energy efficiency for memory depends on several other factors. Adiabatic and reversible methods applied to logic have promise, but overheads have precluded practical use. However, the regular array structure of memory arrays tends to reduce overhead and makes adiabatic memory a viable option. This paper reports an adiabatic memory that has been tested at about 85× improvement over standard designs for energy efficiency. Combining these approaches could set energy efficiency expectations for processor-in-memory computing systems.
Bouchard, Kristofer E.; Aimone, James B.; Chun, Miyoung; T, Dean; Denker, Michael; Diesmann, Markus; Donofrio, David D.; Frank, Loren M.; Kasthuri, Narayanan; C, Koch; Ruebel, Oliver; Simon, Horst D.; Sommer, Friedrich T.; Prabhat, None
Opportunities offered by new neuro-technologies are threatened by lack of coherent plans to analyze, manage, and understand the data. High-performance computing will allow exploratory analysis of massive datasets stored in standardized formats, hosted in open repositories, and integrated with simulations.