For at least the last 20 years, many have tried to create a general resource management system to support interoperability across various concurrent libraries. The previous strategies all suffered from additional toolchain requirements, and/or a usage of a shared programing model that assumed it owned/controlled access to all resources available to the program. None of these techniques have achieved wide spread adoption. The ubiquity of OpenMP coupled with C++ developing a standard way to describe many different concurrent paradigms (C++23 executors) would allow OpenMP to assume the role of a general resource manager without requiring user code written directly in OpenMP. With a few added features such as the ability to use otherwise idle threads to execute tasks and to specify a task “width”, many interesting concurrent frameworks could be developed in native OpenMP and achieve high performance. Further, one could create concrete C++ OpenMP executors that enable support for general C++ executor based codes, which would allow Fortran, C, and C++ codes to use the same underlying concurrent framework when expressed as native OpenMP or using language specific features. Effectively, OpenMP would become the de facto solution for a problem that has long plagued the HPC community.
As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.
Krichmar, Jeffrey L.; Severa, William M.; Khan, Muhammad S.; Olds, James L.
The Artificial Intelligence (AI) revolution foretold of during the 1960s is well underway in the second decade of the twenty first century. Its period of phenomenal growth likely lies ahead. AI-operated machines and technologies will extend the reach of Homo sapiens far beyond the biological constraints imposed by evolution: outwards further into deep space, as well as inwards into the nano-world of DNA sequences and relevant medical applications. And yet, we believe, there are crucial lessons that biology can offer that will enable a prosperous future for AI. For machines in general, and for AI's especially, operating over extended periods or in extreme environments will require energy usage orders of magnitudes more efficient than exists today. In many operational environments, energy sources will be constrained. The AI's design and function may be dependent upon the type of energy source, as well as its availability and accessibility. Any plans for AI devices operating in a challenging environment must begin with the question of how they are powered, where fuel is located, how energy is stored and made available to the machine, and how long the machine can operate on specific energy units. While one of the key advantages of AI use is to reduce the dimensionality of a complex problem, the fact remains that some energy is required for functionality. Hence, the materials and technologies that provide the needed energy represent a critical challenge toward future use scenarios of AI and should be integrated into their design. Here we look to the brain and other aspects of biology as inspiration for Biomimetic Research for Energy-efficient AI Designs (BREAD).
The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI. In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future. To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate up to a 12 × reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing up to 26.1% improvement in communication time and up to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.
We describe new machine-learning-based methods to defeature CAD models for tetrahedral meshing. Using machine learning predictions of mesh quality for geometric features of a CAD model prior to meshing we can identify potential problem areas and improve meshing outcomes by presenting a prioritized list of suggested geometric operations to users. Our machine learning models are trained using a combination of geometric and topological features from the CAD model and local quality metrics for ground truth. We demonstrate a proof-of-concept implementation of the resulting work ow using Sandia's Cubit Geometry and Meshing Toolkit.
There is a need for efficient optimization strategies to efficiently solve large-scale, nonlinear optimization problems. Many problem classes, including design under uncertainty are inherently structured and can be accelerated with decomposition approaches. This paper describes a second-order multiplier update for the alternating direction method of multipliers (ADMM) to solve nonlinear stochastic programming problems. We exploit connections between ADMM and the Schur-complement decomposition to derive an accelerated version of ADMM. Specifically, we study the effectiveness of performing a Newton-Raphson algorithm to compute multiplier estimates for the method of multipliers (MM). We interpret ADMM as a decomposable version of MM and propose modifications to the multiplier update of the standard ADMM scheme based on improvements observed in MM. The modifications to the ADMM algorithm seek to accelerate solutions of optimization problems for design under uncertainty and the numerical effectiveness of the approaches is demonstrated on a set of ten stochastic programming problems. Practical strategies for improving computational performance are discussed along with comparisons between the algorithms. We observe that the second-order update achieves convergence in fewer unconstrained minimizations for MM on general nonlinear problems. In the case of ADMM, the second-order update reduces significantly the number of subproblem solves for convex quadratic programs (QPs).
Drinking water utilities use booster stations to maintain chlorine residuals throughout water distribution systems. Booster stations could also be used as part of an emergency response plan to minimize health risks in the event of an unintentional or malicious contamination incident. The benefit of booster stations for emergency response depends on several factors, including the reaction between chlorine and an unknown contaminant species, the fate and transport of the contaminant in the water distribution system, and the time delay between detection and initiation of boosted levels of chlorine. This paper takes these aspects into account and proposes a mixed-integer linear program formulation for optimizing the placement of booster stations for emergency response. A case study is used to explore the ability of optimally placed booster stations to reduce the impact of contamination in water distribution systems.
In the field of semiconductor quantum dot spin qubits, there is growing interest in leveraging the unique properties of hole-carrier systems and their intrinsically strong spin-orbit coupling to engineer novel qubits. Recent advances in semiconductor heterostructure growth have made available high quality, undoped Ge/SiGe quantum wells, consisting of a pure strained Ge layer flanked by Ge-rich SiGe layers above and below. These quantum wells feature heavy hole carriers and a cubic Rashba-type spin-orbit interaction. Here, we describe progress toward realizing spin qubits in this platform, including development of multi-metal-layer gated device architectures, device tuning protocols, and charge-sensing capabilities. Iterative improvement of a three-layer metal gate architecture has significantly enhanced device performance over that achieved using an earlier single-layer gate design. We discuss ongoing, simulation-informed work to fine-tune the device geometry, as well as efforts toward a single-spin qubit demonstration.
A set of algorithms based on characteristic discontinuous Galerkin methods is presented for tracer transport on the sphere. The algorithms are designed to reduce message passing interface communication volume per unit of simulated time relative to current methods generally, and to the spectral element scheme employed by the U.S. Department of Energy's Exascale Earth System Model (E3SM) specifically. Two methods are developed to enforce discrete mass conservation when the transport schemes are coupled to a separate dynamics solver; constrained transport and Jacobian-combined transport. A communication-efficient method is introduced to enforce tracer consistency between the transport scheme and dynamics solver; this method also provides the transport scheme's shape preservation capability. A subset of the algorithms derived here is implemented in E3SM and shown to improve transport performance by a factor of 2.2 for the model's standard configuration with 40 tracers at the strong scaling limit of one element per core.
In the field of semiconductor quantum dot spin qubits, there is growing interest in leveraging the unique properties of hole-carrier systems and their intrinsically strong spin-orbit coupling to engineer novel qubits. Recent advances in semiconductor heterostructure growth have made available high quality, undoped Ge/SiGe quantum wells, consisting of a pure strained Ge layer flanked by Ge-rich SiGe layers above and below. These quantum wells feature heavy hole carriers and a cubic Rashba-type spin-orbit interaction. Here, we describe progress toward realizing spin qubits in this platform, including development of multi-metal-layer gated device architectures, device tuning protocols, and charge-sensing capabilities. Iterative improvement of a three-layer metal gate architecture has significantly enhanced device performance over that achieved using an earlier single-layer gate design. We discuss ongoing, simulation-informed work to fine-tune the device geometry, as well as efforts toward a single-spin qubit demonstration.
While peak shaving is commonly used to reduce power costs, chemical process facilities that can reduce power consumption on demand during emergencies (e.g., extreme weather events) bring additional value through improved resilience. For process facilities to effectively negotiate demand response (DR) contracts and make investment decisions regarding flexibility, they need to quantify their additional value to the grid. We present a grid‐centric mixed‐integer stochastic programming framework to determine the value of DR for improving grid resilience in place of capital investments that can be cost prohibitive for system operators. We formulate problems using both a linear approximation and a nonlinear alternating current power flow model. Our numerical results with both models demonstrate that DR can be used to reduce the capital investment necessary for resilience, increasing the value that chemical process facilities bring through DR. However, the linearized model often underestimates the amount of DR needed in our case studies. Published 2018. This article is a U.S. Government work and is in the public domain in the USA. AIChE J , 65: e16508, 2019
Photodetection plays a key role in basic science and technology, with exquisite performance having been achieved down to the single-photon level. Further improvements in photodetectors would open new possibilities across a broad range of scientific disciplines and enable new types of applications. However, it is still unclear what is possible in terms of ultimate performance and what properties are needed for a photodetector to achieve such performance. Here, we present a general modeling framework for photodetectors whereby the photon field, the absorption process, and the amplification process are all treated as one coupled quantum system. The formalism naturally handles field states with single or multiple photons as well as a variety of detector configurations and includes a mathematical definition of ideal photodetector performance. The framework reveals how specific photodetector architectures introduce limitations and tradeoffs for various performance metrics, providing guidance for optimization and design.
Large-scale collaborative scientific software projects require more knowledge than any one person typically possesses. This makes coordination and communication of knowledge and expertise a key factor in creating and safeguarding software quality, without which we cannot have sustainable software. However, as researchers attempt to scale up the production of software, they are confronted by problems of awareness and understanding. This presents an opportunity to develop better practices and tools that directly address these challenges. To that end, we conducted a case study of developers of the Trilinos project. We surveyed the software development challenges addressed and show how those problems are connected with what they know and how they communicate. Based on these data, we provide a series of practicable recommendations, and outline a path forward for future research.
Physical security systems (PSS) and humans are inescapably tied in the current physical security paradigm. Yet, physical security system evaluations often end at the console that displays information to the human. That is, these evaluations do not account for human-in-The-loop factors that can greatly impact performance of the security system, even though methods for doing so are well-established. This paper highlights two examples of methods for evaluating the human component of the current physical security system. One of these methods is qualitative, focusing on the information the human needs to adequately monitor alarms on a physical site. The other of these methods objectively measures the impact of false alarm rates on threat detection. These types of human-centric evaluations are often treated as unnecessary or not cost effective under the belief that human cognition is straightforward and errors can be either trained away or mitigated with technology. These assumptions are not always correct, are often surprising, and can often only be identified with objective assessments of human-system performance. Thus, taking the time to perform human element evaluations can identify unintuitive human-system weaknesses and can provide significant cost savings in the form of mitigating vulnerabilities and reducing costly system patches or retrofits to correct an issue after the system has been deployed.
The ECP/VTK-m project is providing the core capabilities to perform scientific visualization on Exascale architectures. The ECP/VTK-m project fills the critical feature gap of performing visualization and analysis on processors like graphics-based processors and many integrated core. The results of this project will be delivered in tools like ParaView, Vislt, and Ascent as well as in stand-alone form. Moreover, these projects are depending on this ECP effort to be able to make effective use of ECP architectures.
This work explores the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. This study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of the original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.
Emerging memory devices, such as resistive crossbars, have the capacity to store large amounts of data in a single array. Acquiring the data stored in large-capacity crossbars in a sequential fashion can become a bottleneck. We present practical methods, based on sparse sampling, to quickly acquire sparse data stored on emerging memory devices that support the basic summation kernel, reducing the acquisition time from linear to sub-linear. The experimental results show that at least an order of magnitude improvement in acquisition time can be achieved when the data are sparse. Finally, in addition, we show that the energy cost associated with our approach is competitive to that of the sequential method.
With Non-Volatile Memories (NVMs) beginning to enter the mainstream computing market, it is time to consider how to secure NVM-equipped computing systems. Recent Meltdown and Spectre attacks are evidence that security must be intrinsic to computing systems and not added as an afterthought. Processor vendors are taking the first steps and are beginning to build security primitives into commodity processors. One security primitive that is associated with the use of emerging NVMs is memory encryption. Memory encryption, while necessary, is very challenging when used with NVMs because it exacerbates the write endurance problem. Secure architectures use cryptographic metadata that must be persisted and restored to allow secure recovery of data in the event of power-loss. Specifically, encryption counters must be persistent to enable secure and functional recovery of an interrupted system. However, the cost of ensuring and maintaining persistence for these counters can be significant. In this paper, we propose a novel scheme to maintain encryption counters without the need for frequent updates. Our new memory controller design, Osiris, repurposes memory Error-Correction Codes (ECCs) to enable fast restoration and recovery of encryption counters. To evaluate our design, we use Gem5 to run eight memory-intensive workloads selected from SPEC2006 and U.S. Department of Energy (DoE) proxy applications. Compared to a write-Through counter-cache scheme, on average, Osiris can reduce 48.7% of the memory writes (increase lifetime by 1.95x), and reduce the performance overhead from 51.5% (for write-Through) to only 5.8%. Furthermore, without the need for backup battery or extra power-supply hold-up time, Osiris performs better than a battery-backed write-back (5.8% vs. 6.6% overhead) and has less write-Traffic (2.6% vs. 5.9% overhead).
Nanocrystalline metals offer significant improvements in structural performance over conventional alloys. However, their performance is limited by grain boundary instability and limited ductility. Solute segregation has been proposed as a stabilization mechanism, however the solute atoms can embrittle grain boundaries and further degrade the toughness. In the present study, we confirm the embrittling effect of solute segregation in Pt-Au alloys. However, more importantly, we show that inhomogeneous chemical segregation to the grain boundary can lead to a new toughening mechanism termed compositional crack arrest. Energy dissipation is facilitated by the formation of nanocrack networks formed when cracks arrested at regions of the grain boundaries that were starved in the embrittling element. This mechanism, in concert with triple junction crack arrest, provides pathways to optimize both thermal stability and energy dissipation. A combination of in situ tensile deformation experiments and molecular dynamics simulations elucidate both the embrittling and toughening processes that can occur as a function of solute content.
Deep neural networks are often computationally expensive, during both the training stage and inference stage. Training is always expensive, because back-propagation requires high-precision floating-pointmultiplication and addition. However, various mathematical optimizations may be employed to reduce the computational cost of inference. Optimized inference is important for reducing power consumption and latency and for increasing throughput. This chapter introduces the central approaches for optimizing deep neural network inference: pruning "unnecessary" weights, quantizing weights and inputs, sharing weights between layer units, compressing weights before transferring from main memory, distilling large high-performance models into smaller models, and decomposing convolutional filters to reduce multiply and accumulate operations. In this chapter, using a unified notation, we provide a mathematical and algorithmic description of the aforementioned deep neural network inference optimization methods.
A forensics investigation after a breach often uncovers network and host indicators of compromise (IOCs) that can be deployed to sensors to allow early detection of the adversary in the future. Over time, the adversary will change tactics, techniques, and procedures (TTPs), which will also change the data generated. If the IOCs are not kept up-to-date with the adversary's new TTPs, the adversary will no longer be detected once all of the IOCs become invalid. Tracking the Known (TTK) is the problem of keeping IOCs, in this case regular expression (regexes), up-to-date with a dynamic adversary. Our framework solves the TTK problem in an automated, cyclic fashion to bracket a previously discovered adversary. This tracking is accomplished through a data-driven approach of self-adapting a given model based on its own detection capabilities.In our initial experiments, we found that the true positive rate (TPR) of the adaptive solution degrades much less significantly over time than the naïve solution, suggesting that self-updating the model allows the continued detection of positives (i.e., adversaries). The cost for this performance is in the false positive rate (FPR), which increases over time for the adaptive solution, but remains constant for the naïve solution. However, the difference in overall detection performance, as measured by the area under the curve (AUC), between the two methods is negligible. This result suggests that self-updating the model over time should be done in practice to continue to detect known, evolving adversaries.
Shor's groundbreaking quantum algorithm for integer factoring provides an exponential speedup over the best-known classical algorithms. In the 20 years since Shor's algorithm was conceived, only a handful of fundamental quantum algorithmic kernels, generally providing modest polynomial speedups over classical algorithms, have been invented. To better understand the potential advantage quantum resources provide over their classical counterparts, one may consider other resources than execution time of algorithms. Quantum Approximation Algorithms direct the power of quantum computing towards optimization problems where quantum resources provide higher-quality solutions instead of faster execution times. We provide a new rigorous analysis of the recent Quantum Approximate Optimization Algorithm, demonstrating that it provably outperforms the best known classical approximation algorithm for special hard cases of the fundamental Maximum Cut graph-partitioning problem. We also develop new types of classical approximation algorithms for finding near-optimal low-energy states of physical systems arising in condensed matter by extending seminal discrete optimization techniques. Our interdisciplinary work seeks to unearth new connections between discrete optimization and quantum information science.
High resolution simulation of viscous fingering can offer an accurate and detailed prediction for subsurface engineering processes involving fingering phenomena. The fully implicit discontinuous Galerkin (DG) method has been shown to be an accurate and stable method to model viscous fingering with high Peclet number and mobility ratio. In this paper, we present two techniques to speedup large scale simulations of this kind. The first technique relies on a simple p-adaptive scheme in which high order basis functions are employed only in elements near the finger fronts where the concentration has a sharp change. As a result, the number of degrees of freedom is significantly reduced and the simulation yields almost identical results to the more expensive simulation with uniform high order elements throughout the mesh. The second technique for speedup involves improving the solver efficiency. We present an algebraic multigrid (AMG) preconditioner which allows the DG matrix to leverage the robust AMG preconditioner designed for the continuous Galerkin (CG) finite element method. The resulting preconditioner works effectively for fixed order DG as well as p-adaptive DG problems. With the improvements provided by the p-adaptivity and AMG preconditioning, we can perform high resolution three-dimensional viscous fingering simulations required for miscible displacement with high Peclet number and mobility ratio in greater detail than before for well injection problems.
Gate-controllable spin-orbit coupling is often one requisite for spintronic devices. For practical spin field-effect transistors, another essential requirement is ballistic spin transport, where the spin precession length is shorter than the mean free path such that the gate-controlled spin precession is not randomized by disorder. In this letter, we report the observation of a gate-induced crossover from weak localization to weak anti-localization in the magneto-resistance of a high-mobility two-dimensional hole gas in a strained germanium quantum well. From the magneto-resistance, we extract the phase-coherence time, spin-orbit precession time, spin-orbit energy splitting, and cubic Rashba coefficient over a wide density range. The mobility and the mean free path increase with increasing hole density, while the spin precession length decreases due to increasingly stronger spin-orbit coupling. As the density becomes larger than ∼6 × 1011 cm-2, the spin precession length becomes shorter than the mean free path, and the system enters the ballistic spin transport regime. We also report here the numerical methods and code developed for calculating the magneto-resistance in the ballistic regime, where the commonly used HLN and ILP models for analyzing weak localization and anti-localization are not valid. These results pave the way toward silicon-compatible spintronic devices.
Here, the feasibility of Neumann series expansion of Maxwell’s equations in the electrostatic limit is investigated for potentially rapid and approximate subsurface imaging of geologic features proximal to metallic infrastructure in an oilfield environment. While generally useful for efficient modeling of mild conductivity perturbations in uncluttered settings, we raise the question of its suitability for situations, such as oilfield, where metallic artifacts are pervasive, and in some cases, in direct electrical contact with the conductivity perturbation on which the Neumann series is computed. Convergence of the Neumann series and its residual error are computed using the hierarchical finite element framework for a canonical oilfield model consisting of an “L” shaped, steel-cased well, energized by a steady state electrode, and penetrating a small set of mildly conducting fractures near the heel of the well. For a given node spacing h in the finite element mesh, we find that the Neumann series is ultimately convergent if the conductivity is small enough - a result consistent with previous presumptions on the necessity of small conductivity perturbations. However, we also demonstrate that the spectral radius of the Neumann series operator grows as ~ 1/h, thus suggesting that in the limit of the continuous problem h → 0, the Neumann series is intrinsically divergent for all conductivity perturbation, regardless of their smallness. The hierarchical finite element methodology itself is critically analyzed and shown to possess the h2 error convergence of traditional linear finite elements, thereby supporting the conclusion of an inescapably divergent Neumann series for this benchmark example. Application of the Neumann series to oilfield problems with metallic clutter should therefore be done with careful consideration to the coupling between infrastructure and geology. Here, the methods used here are demonstrably useful in such circumstances.
Accurate and efficient constitutive modeling remains a cornerstone issue for solid mechanics analysis. Over the years, the LAMÉ advanced material model library has grown to address this challenge by implementing models capable of describing material systems spanning soft polymers to stiff ceramics including both isotropic and anisotropic responses. Inelastic behaviors including (visco)plasticity, damage, and fracture have all incorporated for use in various analyses. This multitude of options and flexibility, however, comes at the cost of many capabilities, features, and responses and the ensuing complexity in the resulting implementation. Therefore, to enhance confidence and enable the utilization of the LAMÉ library in application, this effort seeks to document and verify the various models in the LAMÉ library. Specifically, the broader strategy, organization, and interface of the library itself is first presented. The physical theory, numerical implementation, and user guide for a large set of models is then discussed. Importantly, a number of verification tests are performed with each model to not only have confidence in the model itself but also highlight some important response characteristics and features that may be of interest to end-users. Finally, in looking ahead to the future, approaches to add material models to this library and further expand the capabilities are presented.