In many aerospace applications, it is critical to be able to model fluid-structure interactions. In particular, correctly predicting the power spectral density of pressure fluctuations at surfaces can be important for assessing potential resonances and failure modes. Current turbulence modeling methods, such as wall-modeled Large Eddy Simulation and Detached Eddy Simulation, cannot reliably predict these pressure fluctuations for many applications of interest. The focus of this paper is on efforts to use data-driven machine learning methods to learn correction terms for the wall pressure fluctuation spectrum. In particular, the non-locality of the wall pressure fluctuations in a compressible boundary layer is investigated using random forests and neural networks trained and evaluated on Direct Numerical Simulation data.
We establish an atomistic view of the high- and low-temperature phases of iron/steel as well as some elements of the phase transition between these phases on cooling. In particular we examine the 4 most common orientation relationships between the high temperature austenite and low-temperature ferrite phases seen in experiment. With a thorough understanding of these relationships we are prepared to set up various atomistic simulations, using techniques such as Density Functional Theory and Molecular Dynamics, to further study the phase transition, in particular, quantities needed for Phase Field Modeling, such as the free energies of bulk phases and the phase transition front propagation velocity.
The development of scramjet engines is an important research area for advancing hypersonic and orbital flights. Progress towards optimal engine designs requires both accurate flow simulations as well as uncertainty quantification (UQ). However, performing UQ for scramjet simulations is challenging due to the large number of uncertain parameters involved and the high computational cost of flow simulations. We address these difficulties by combining UQ algorithms and numerical methods to the large eddy simulation of the HIFiRE scramjet configuration. First, global sensitivity analysis is conducted to identify influential uncertain input parameters, helping reduce the stochastic dimension of the problem and discover sparse representations. Second, as models of different fidelity are available and inevitably used in the overall UQ assessment, a framework for quantifying and propagating the uncertainty due to model error is introduced. These methods are demonstrated on a non-reacting scramjet unit problem with parameter space up to 24 dimensions, using 2D and 3D geometries with static and dynamic treatments of the turbulence subgrid model.
When solving partial differential equations (PDEs) with random inputs, it is often computationally inefficient to merely propagate samples of the input probability law (or an approximation thereof) because the input law may not accurately capture the behavior of critical system responses that depend on the PDE solution. To further complicate matters, in many applications it is critical to accurately approximate the “risk” associated with the statistical tails of the system responses, not just the statistical moments. In this paper, we develop an adaptive sampling and local reduced basis method for approximately solving PDEs with random inputs. Our method determines a set of parameter atoms and an associated (implicit) Voronoi partition of the parameter domain on which we build local reduced basis approximations of the PDE solution. In addition, we extend our adaptive sampling approach to accurately compute measures of risk evaluated at quantities of interest that depend on the PDE solution.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Haddock, Walker; Curry, Matthew L.; Bangalore, Purushotham V.; Skjellum, Anthony
High-performance computing (HPC) demands high bandwidth and low latency in I/O performance leading to the development of storage systems and I/O software components that strive to provide greater and greater performance. However, capital and energy budgets along with increasing storage capacity requirements have motivated the search for lower cost, large storage systems for HPC. With Burst Buffer technology increasing the bandwidth and reducing the latency for I/O between the compute and storage systems, the back-end storage bandwidth and latency requirements can be reduced, especially underneath an adequately sized modern parallel file system. Cloud computing has led to the development of large, low-cost storage solutions where design has focused on high capacity, availability, and low energy consumption at lowest cost. Cloud computing storage systems leverage duplicates and erasure coding technology to provide high availability at much lower cost than traditional HPC storage systems. Leveraging certain cloud storage infrastructure and concepts in HPC would be valuable economically in terms of cost-effective performance for certain storage tiers. To enable the use of cloud storage technologies for HPC we study the architecture for interfacing cloud storage between the HPC parallel file systems and the archive storage. In this paper, we report our comparison of two erasure coding implementations for the Ceph file system. We compare measurements of various degrees of sharding that are relevant for HPC applications. We show that the Gibraltar GPU Erasure coding library outperforms a CPU implementation of an erasure coding plugin for the Ceph object storage system, opening the potential for new ways to architect such storage systems based on Ceph.
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
Visual inspection research has a long history spanning the 20th century and continuing to the present day. Current efforts in multiple venues demonstrate that visual inspection continues to have a vital role for many different types of tasks in the 21st century. The nature of this role spans the range from traditional human visual inspection to fully automated detection of defects. Consequently, today's practitioners must not only successfully identify and apply lessons learned from the past, but also explore new areas of research in order to derive solutions for modern day issues such as those presented by introducing automation during inspection. A key lesson from past research indicates that the factors that can degrade performance will persist today, unless care is taken to design the inspection process appropriately.
The purpose of this document is to compare and contrast metrics that may be considered for use in validating computational models. Metrics suitable for use in one application, scenario, and/or quantity of interest may not be acceptable in another; these notes merely provide information that may be used as guidance in selecting a validation metric.
We consider heuristic and optimal solutions to a discrete geometric bin packing problem that arises in a resource allocation problem. An imaging sensor is assigned to collect data over a large area, but some subregions are more valuable than others. To capture these high-value regions with higher fidelity, we can assign some number of non-overlapping rectangular subsets, called “subfootprints.” The sensor image is partitioned into squares called “chips”, and each chip is further partitioned into pixels. Pixels may have different values. Subfootprints are restricted to rectangular collections of chips, but we are free to choose different rectangle heights, widths, and areas. We seek the optimal arrangement over the family of possible rectangle shapes and sizes. We provide a mixed-integer linear program optimization formulation, as well as a greedy heuristic, to solve this problem. For the meta-problem, we have some freedom to align the chip boundaries to different pixels. However, it is too expensive to solve the optimization formulation for each alignment. However, we show that the greedy heuristic can inform which pixel alignments are worth solving the optimization over. We use a variant of k-means clustering to group greedy solutions by their transport shape-similarity. For each cluster, we run the optimization problem over the greedy layout with the highest value. In practice this efficiently explores the geometric configuration space, and produces solutions close to the global optimum. We show a contrived example using surveillance of the Mississippi River. Our software is available as open-source in the Github repository “GeoPlace .
Information loss from a computation implies energy dissipation due to Landauer’s Principle. Thus, increasing the amount of useful computational work that can be accomplished within a given energy budget will eventually require increasing the degree to which our computing technologies avoid information loss, i.e., are logically reversible. But the traditional definition of logical reversibility is actually more restrictive than is necessary to avoid information loss and energy dissipation due to Landauer’s Principle. As a result, the operations that have traditionally been viewed as the atomic elements of reversible logic, such as Toffoli gates, are not really the simplest primitives that one can use for the design of reversible hardware. Arguably, a complete theoretical framework for reversible computing should provide a more general, parsimonious foundation for practical engineering. To this end, we use a rigorous quantitative formulation of Landauer’s Principle to develop the theory of Generalized Reversible Computing (GRC), which precisely characterizes the minimum requirements for a computation to avoid information loss and the consequent energy dissipation, showing that a much broader range of computations are, in fact, reversible than is acknowledged by traditional reversible computing theory. This paper summarizes the foundations of GRC theory and briefly presents a few of its applications.
Water utilities are vulnerable to a wide variety of human-caused and natural disasters. The Water Network Tool for Resilience (WNTR) is a new open source Python™ package designed to help water utilities investigate resilience of water distribution systems to hazards and evaluate resilience-enhancing actions. In this paper, the WNTR modeling framework is presented and a case study is described that uses WNTR to simulate the effects of an earthquake on a water distribution system. The case study illustrates that the severity of damage is not only a function of system integrity and earthquake magnitude, but also of the available resources and repair strategies used to return the system to normal operating conditions. While earthquakes are particularly concerning since buried water distribution pipelines are highly susceptible to damage, the software framework can be applied to other types of hazards, including power outages and contamination incidents.
With the growing complexity and scale of high performance computing (HPC) systems, application performance variation has become a significant challenge in efficient and resilient system management. Application performance variation can be caused by resource contention as well as software- and firmware-related problems, and can lead to premature job termination, reduced performance, and wasted compute platform resources. To effectively alleviate this problem, system administrators must detect and identify the anomalies that are responsible for performance variation and take preventive actions. However, diagnosing anomalies is often a difficult task given the vast amount of noisy and high-dimensional data being collected via a variety of system monitoring infrastructures. In this paper, we present a novel framework that uses machine learning to automatically diagnose previously encountered performance anomalies in HPC systems. Our framework leverages resource usage and performance counter data collected during application runs. We first convert the collected time series data into statistical features that retain application characteristics to significantly reduce the computational overhead of our technique. We then use machine learning algorithms to learn anomaly characteristics from this historical data and to identify the types of anomalies observed while running applications. We evaluate our framework both on an HPC cluster and on a public cloud, and demonstrate that our approach outperforms current state-of-the-art techniques in detecting anomalies, reaching an F-score over 0.97.
We show how to place a set of seed points such that a given piecewise linear complex is the union of some faces in the resulting Voronoi diagram. The seeds are placed on sufficiently small spheres centered at input vertices and are arranged into little circles around each half-edge where every seed is mirrored across the associated triangle. The Voronoi faces common to the seeds of such arrangements yield a mesh conforming to the input complex. If the input contains sharp angles, then additional seeds are needed, analogous to nonobtuse refinement. Finally, we propose local optimizations to reduce the number of seeds and output facets.
Optimizing communication performance is imperative for largescale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.
The Rim-to-Rim Wearables At The Canyon for Health (R2R WATCH) study examines metrics recordable on commercial off the shelf (COTS) devices that are most relevant and reliable for the earliest possible indication of a health or performance decline. This is accomplished through collaboration between Sandia National Laboratories (SNL) and The University of New Mexico (UNM) where the two organizations team up to collect physiological, cognitive, and biological markers from volunteer hikers who attempt the Rim-to-Rim (R2R) hike at the Grand Canyon. Three forms of data are collected as hikers travel from rim to rim: physiological data through wearable devices, cognitive data through a cognitive task taken every 3 hours, and blood samples obtained before and after completing the hike. Data is collected from both civilian and warfighter hikers. Once the data is obtained, it is analyzed to understand the effectiveness of each COTS device and the validity of the data collected. We also aim to identify which physiological and cognitive phenomena collected by wearable devices are the most relatable to overall health and task performance in extreme environments, and of these ascertain which markers provide the earliest yet reliable indication of health decline. Finally, we analyze the data for significant differences between civilians’ and warfighters’ markers and the relationship to performance. This is a study funded by the Defense Threat Reduction Agency (DTRA, Project CB10359) and the University of New Mexico (The main portion of the R2R WATCH study is funded by DTRA. UNM is currently funding all activities related to bloodwork. DTRA, Project CB10359; SAND2017-1872 C). This paper describes the experimental design and methodology for the first year of the R2R WATCH project.
Industry's inability to reduce logic gates' energy consumption is slowing growth in an important part of the worldwide economy. Some scientists argue that alternative approaches could greatly reduce energy consumption. These approaches entail myriad technical and political issues.
Approximate methods for electronic structure, implemented in sophisticated computer codes and married to ever-more powerful computing platforms, have become invaluable in chemistry and materials science. The maturing and consolidation of quantum chemistry codes since the 1980s, based upon explicitly correlated electronic wave functions, has made them a staple of modern molecular chemistry. Here, the impact of first principles electronic structure in physics and materials science had lagged owing to the extra formal and computational demands of bulk calculations.
High performance computing (HPC) is undergoing a dramatic change in computing architectures. Nextgeneration HPC systems are being based primarily on many-core processing units and general purpose graphics processing units (GPUs). A computing node on a next-generation system can be, and in practice is, heterogeneous in nature, involving multiple memory spaces and multiple execution spaces. This presents a challenge for the development of application codes that wish to compute at the extreme scales afforded by these next-generation HPC technologies and systems - the best parallel programming model for one system is not necessarily the best parallel programming model for another. This inevitably raises the following question: how does an application code achieve high performance on disparate computing architectures without having entirely different, or at least significantly different, code paths, one for each architecture? This question has given rise to the term ‘performance portability’, a notion concerned with porting application code performance from architecture to architecture using a single code base. In this paper, we present the work being done at Sandia National Labs to develop a performance portable compressible CFD code that is targeting the ‘leadership’ class supercomputers the National Nuclear Security Administration (NNSA) is acquiring over the course of the next decade.
We investigate a novel application of deep neural networks to modeling of errors in prediction of surface pressure fluctuations beneath a compressible, turbulent flow. In this context, the truth solution is given by Direct Numerical Simulation (DNS) data, while the predictive model is a wall-modeled Large Eddy Simulation (LES). The neural network provides a means to map relevant statistical flow-features within the LES solution to errors in prediction of wall pressure spectra. We simulate a number of flat plate turbulent boundary layers using both DNS and wall-modeled LES to build up a database with which to train the neural network. We then apply machine learning techniques to develop an optimized neural network model for the error in terms of relevant flow features.
This paper examines the variability of predicted responses when multiple stress-strain curves (reflecting variability from replicate material tests) are propagated through a transient dynamics finite element model of a ductile steel can being slowly crushed. An elastic-plastic constitutive model is employed in the large-deformation simulations. Over 70 response quantities of interest (including displacements, stresses, strains, and calculated measures of material damage) are tracked in the simulations. Each response quantity’s behavior varies according to the particular stress-strain curves used for the materials in the model. The present work assigns the same material to all the can parts: lids, walls, and weld. We desire to estimate response variability due to variability of the input material curves. When only a few stress-strain curve samples are available from material testing, response variance will usually be significantly underestimated. This is undesirable for many engineering purposes. A simple classical statistical method, Tolerance Intervals, is tested for effectively compensating for sparse stress-strain curve data. The method is found to perform well on the highly nonlinear input-to-output response mappings and non-standard response distributions in the can-crush problem. The results and discussion in this paper, and further studies referenced, support a proposition that the method will apply similarly well for other sparsely sampled random functions.
Emerging novel architectures for shared memory parallel computing are incorporating increasingly creative innovations to deliver higher memory performance. A notable exemplar of this phenomenon is the Multi-Channel DRAM (MCDRAM) that is included in the Intel® XeonPhi™ processors. In this paper, we examine techniques to use OpenMP to exploit the high bandwidth of MCDRAM by staging data. In particular, we implement double buffering using OpenMP sections and tasks to explicitly manage movement of data into MCDRAM. We compare our double-buffered approach to a non-buffered implementation and to Intel’s cache mode, in which the system manages the MCDRAM as a transparent cache. We also demonstrate the sensitivity of performance to parameters such as dataset size and the distribution of threads between compute and copy operations.
Silicon-based metal-oxide-semiconductor quantum dots are prominent candidates for high-fidelity, manufacturable qubits. Due to silicon's band structure, additional low-energy states persist in these devices, presenting both challenges and opportunities. Although the physics governing these valley states has been the subject of intense study, quantitative agreement between experiment and theory remains elusive. Here, we present data from an experiment probing the valley states of quantum dot devices and develop a theory that is in quantitative agreement with both this and a recently reported experiment. Through sampling millions of realistic cases of interface roughness, our method provides evidence that the valley physics between the two samples is essentially the same.
We propose a new particle-in-cell (PIC) method for the simulation of plasmas based on a recently developed, unconditionally stable solver for the wave equation. This method is not subject to a CFL restriction, limiting the ratio of the time step size to the spatial step size, typical of explicit methods, while maintaining computational cost and code complexity comparable to such explicit schemes. We describe the implementation in one and two dimensions for both electrostatic and electromagnetic cases, and present the results of several standard test problems, showing good agreement with theory with time step sizes much larger than allowed by typical CFL restrictions.
Here, first-principles molecular dynamics simulations were used to investigate the dissociation of sarin (GB) on the calcium silicate hydrate (CSH) mineral tobermorite (TBM), a surrogate for cement. CSH minerals (including TBM) and amorphous materials of similar composition are the major components of Portland cement, the binding agent of concrete. Metadynamics simulations were used to investigate the effect of the TBM surface and confinement in a microscale pore on the mechanism and free energy of dissociation of GB. Our results indicate that both the adsorption site and the humidity of the local environment significantly affect the sarin dissociation energy. In particular, sarin dissociation in a low-water environment occurs via a dealkylation mechanism, which is consistent with previous experimental studies.
2016 IEEE International Conference on Rebooting Computing, ICRC 2016 - Conference Proceedings
DeBenedictis, Erik; Frank, Michael P.; Ganesh, Natesh; Anderson, Neal G.
At roughly kT energy dissipation per operation, the thermodynamic energy efficiency "limits" of Moore's Law were unimaginably far off in the 1960s. However, current computers operate at only 100-10,000 times this limit, forming an argument that historical rates of efficiency scaling must soon slow. This paper reviews the justification for the ∼kT per operation limit in the context of processors for von Neumann-class computer architectures of the 1960s. We then reapply the fundamental arguments to contemporary applications and identify a new direction for future computing in which the ultimate efficiency limits would be much further out. New nanodevices with high-level functions that aggregate the functionality of several logic gates and some local memory may be the right building blocks for much more energy efficient execution of emerging applications - such as neural networks.
Amidst the rising impact of machine learning and the popularity of deep neural networks, learning theory is not a solved problem. With the emergence of neuromorphic computing as a means of addressing the von Neumann bottleneck, it is not simply a matter of employing existing algorithms on new hardware technology, but rather richer theory is needed to guide advances. In particular, there is a need for a richer understanding of the role of adaptivity in neural learning to provide a foundation upon which architectures and devices may be built. Modern machine learning algorithms lack adaptive learning, in that they are dominated by a costly training phase after which they no longer learn. The brain on the other hand is continuously learning and provides a basis for which new mathematical theories may be developed to greatly enrich the computational capabilities of learning systems. Game theory provides one alternative mathematical perspective analyzing strategic interactions and as such is well suited to learning theory.
Continuing to improve computational energy efficiency will soon require developing and deploying new operational paradigms for computation that circumvent the fundamental thermodynamic limits that apply to conventionally-implemented Boolean logic circuits. In particular, Landauer's principle tells us that irreversible information erasure requires a minimum energy dissipation of kT ln 2 per bit erased, where k is Boltzmann's constant and T is the temperature of the available heat sink. However, correctly applying this principle requires carefully characterizing what actually constitutes "information erasure" within a given physical computing mechanism. In this paper, we show that abstract combinational logic networks can validly be considered to contain no information beyond that specified in their input, and that, because of this, appropriately-designed physical implementations of even multi-layer networks can in fact be updated in a single step while incurring no greater theoretical minimum energy dissipation than is required to update their inputs. Furthermore, this energy can approach zero if the network state is updated adiabatically via a reversible transition process. Our novel operational paradigm for updating logic networks suggests an entirely new class of hardware devices and circuits that can be used to reversibly implement Boolean logic with energy dissipation far below the Landauer limit.
For decades, neural networks have shown promise for next-generation computing, and recent breakthroughs in machine learning techniques, such as deep neural networks, have provided state-of-the-art solutions for inference problems. However, these networks require thousands of training processes and are poorly suited for the precise computations required in scientific or similar arenas. The emergence of dedicated spiking neuromorphic hardware creates a powerful computational paradigm which can be leveraged towards these exact scientific or otherwise objective computing tasks. We forego any learning process and instead construct the network graph by hand. In turn, the networks produce guaranteed success often with easily computable complexity. We demonstrate a number of algorithms exemplifying concepts central to spiking networks including spike timing and synaptic delay. We also discuss the application of cross-correlation particle image velocimetry and provide two spiking algorithms; one uses time-division multiplexing, and the other runs in constant time.
We address practical limits of energy efficiency scaling for logic and memory. Scaling of logic will end with unreliable operation, making computers probabilistic as a side effect. The errors can be corrected or tolerated, but overhead will increase with further scaling. We address the tradeoff between scaling and error correction that yields minimum energy per operation, finding new error correction methods with energy consumption limits about 2× below current approaches. The maximum energy efficiency for memory depends on several other factors. Adiabatic and reversible methods applied to logic have promise, but overheads have precluded practical use. However, the regular array structure of memory arrays tends to reduce overhead and makes adiabatic memory a viable option. This paper reports an adiabatic memory that has been tested at about 85× improvement over standard designs for energy efficiency. Combining these approaches could set energy efficiency expectations for processor-in-memory computing systems.
Bouchard, Kristofer E.; Aimone, James B.; Chun, Miyoung; T, Dean; Denker, Michael; Diesmann, Markus; Donofrio, David D.; Frank, Loren M.; Kasthuri, Narayanan; C, Koch; Ruebel, Oliver; Simon, Horst D.; Sommer, Friedrich T.; Prabhat, None
Opportunities offered by new neuro-technologies are threatened by lack of coherent plans to analyze, manage, and understand the data. High-performance computing will allow exploratory analysis of massive datasets stored in standardized formats, hosted in open repositories, and integrated with simulations.