Traditional interpolation techniques for particle tracking include binning and convolutional formulas that use pre-determined (i.e., closed-form, parameteric) kernels. In many instances, the particles are introduced as point sources in time and space, so the cloud of particles (either in space or time) is a discrete representation of the Green's function of an underlying PDE. As such, each particle is a sample from the Green's function; therefore, each particle should be distributed according to the Green's function. In short, the kernel of a convolutional interpolation of the particle sample “cloud” should be a replica of the cloud itself. This idea gives rise to an iterative method by which the form of the kernel may be discerned in the process of interpolating the Green's function. When the Green's function is a density, this method is broadly applicable to interpolating a kernel density estimate based on random data drawn from a single distribution. We formulate and construct the algorithm and demonstrate its ability to perform kernel density estimation of skewed and/or heavy-tailed data including breakthrough curves.
Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.
Optimally-shaped electromagnetic fields have the capacity to coherently control the dynamics of quantum systems and thus offer a promising means for controlling molecular transformations relevant to chemical, biological, and materials applications. Currently, advances in this area are hindered by the prohibitive cost of the quantum dynamics simulations needed to explore the principles and possibilities of molecular control. However, the emergence of nascent quantum-computing devices suggests that efficient simulations of quantum dynamics may be on the horizon. In this article, we study how quantum computers could be employed to design optimally-shaped fields to control molecular systems. We introduce a hybrid algorithm that utilizes a quantum computer for simulating the field-induced quantum dynamics of a molecular system in polynomial time, in combination with a classical optimization approach for updating the field. Qubit encoding methods relevant for molecular control problems are described, and procedures for simulating the quantum dynamics and obtaining the simulation results are discussed. Numerical illustrations are then presented that explicitly treat paradigmatic vibrational and rotational control problems, and also consider how optimally-shaped fields could be used to elucidate the mechanisms of energy transfer in light-harvesting complexes. Resource estimates, as well as a numerical assessment of the impact of hardware noise and the prospects of near-term hardware implementations, are provided for the latter task.
We adapt the robust phase estimation algorithm to the evaluation of energy differences between two eigenstates using a quantum computer. This approach does not require controlled unitaries between auxiliary and system registers or even a single auxiliary qubit. As a proof of concept, we calculate the energies of the ground state and low-lying electronic excitations of a hydrogen molecule in a minimal basis on a cloud quantum computer. The denominative robustness of our approach is then quantified in terms of a high tolerance to coherent errors in the state preparation and measurement. Conceptually, we note that all quantum phase estimation algorithms ultimately evaluate eigenvalue differences.
The stability of low-index platinum surfaces and their electronic properties is investigated with density functional theory, toward the goal of understanding the surface structure and electron emission, and identifying precursors to electrical breakdown, on nonideal platinum surfaces. Propensity for electron emission can be related to a local work function, which, in turn, is intimately dependent on the local surface structure. The (1×N) missing row reconstruction of the Pt(110) surface is systematically examined. The (1×3) missing row reconstruction is found to be the lowest in energy, with the (1×2) and (1×4) slightly less stable. In the limit of large (1×N) with wider (111) nanoterraces, the energy accurately approaches the asymptotic limit of the infinite Pt(111) surface. This suggests a local energetic stability of narrow (111) nanoterraces on free Pt surfaces that could be a common structural feature in the complex surface morphologies, leading to work functions consistent with those on thermally grown Pt substrates.
The first-principles computation of the surfaces of metals is typically accomplished through slab calculations of finite thickness. The extraction of a convergent surface formation energy from slab calculations is dependent upon defining an appropriate bulk reference energy. I describe a method for an independently computed, slab-consistent bulk reference that leads to convergent surface formation energies from slab calculations that also provides realistic uncertainties for the magnitude of unavoidable nonlinear divergence in the surface formation energy with slab thickness. The accuracy is demonstrated on relaxed, unreconstructed low-index aluminum surfaces with slabs with up to 35 layers.
Multivariate time series are used in many science and engineering domains, including health-care, astronomy, and high-performance computing. A recent trend is to use machine learning (ML) to process this complex data and these ML-based frameworks are starting to play a critical role for a variety of applications. However, barriers such as user distrust or difficulty of debugging need to be overcome to enable widespread adoption of such frameworks in production systems. To address this challenge, we propose a novel explainability technique, CoMTE, that provides counterfactual explanations for supervised machine learning frameworks on multivariate time series data. Using various machine learning frameworks and data sets, we compare CoMTE with several state-of-the-art explainability methods and show that we outperform existing methods in comprehensibility and robustness. We also show how CoMTE can be used to debug machine learning frameworks and gain a better understanding of the underlying multivariate time series data.
High-performance computing (HPC) researchers have long envisioned scenarios where application workflows could be improved through the use of programmable processing elements embedded in the network fabric. Recently, vendors have introduced programmable Smart Network Interface Cards (SmartNICs) that enable computations to be offloaded to the edge of the network. There is great interest in both the HPC and high-performance data analytics (HPDA) communities in understanding the roles these devices may play in the data paths of upcoming systems. This paper focuses on characterizing both the networking and computing aspects of NVIDIA’s new BlueField-2 SmartNIC when used in a 100Gb/s Ethernet environment. For the networking evaluation we conducted multiple transfer experiments between processors located at the host, the SmartNIC, and a remote host. These tests illuminate how much effort is required to saturate the network and help estimate the processing headroom available on the SmartNIC during transfers. For the computing evaluation we used the stress-ng benchmark to compare the BlueField-2 to other servers and place realistic bounds on the types of offload operations that are appropriate for the hardware. Our findings from this work indicate that while the BlueField-2 provides a flexible means of processing data at the network’s edge, great care must be taken to not overwhelm the hardware. While the host can easily saturate the network link, the SmartNIC’s embedded processors may not have enough computing resources to sustain more than half the expected bandwidth when using kernel-space packet processing. From a computational perspective, encryption operations, memory operations under contention, and on-card IPC operations on the SmartNIC perform significantly better than the general-purpose servers used for comparisons in our experiments. Therefore, applications that mainly focus on these operations may be good candidates for offloading to the SmartNIC.
Using the local moment counter charge (LMCC) method to accurately represent the asymptotic electrostatic boundary conditions within density functional theory supercell calculations, we present a comprehensive analysis of the atomic structure and energy levels of point defects in cubic silicon carbide (3C-SiC). Finding that the classical long-range dielectric screening outside the supercell induced by a charged defect is a significant contributor to the total energy. we describe and validate a modified Jost screening model to evaluate this polarization energy. This leads to bulk-converged defect levels in finite size supercells. With the LMCC boundary conditions and a standard Perdew-Burke-Ernzerhof (PBE) exchange correlation functional, the computed defect level spectrum exhibits no band gap problem: the range of defect levels spans ∼2.4eV, an effective defect band gap that agrees with the experimental band gap. Comparing with previous literature, our LMCC-PBE defect results are in consistent agreement with the hybrid-exchange functional results of Oda et al. [J. Chem. Phys. 139, 124707 (2013)JCPSA60021-960610.1063/1.4821937] rather than their PBE results. The difference with their PBE results is attributed to their use of a conventional jellium approximation rather than the more rigorous LMCC approach for handling charged supercell boundary conditions. The difference between standard dft and hybrid functional results for defect levels lies not in a band gap problem but rather in solving a boundary condition problem. The LMCC-PBE entirely mitigates the effect of the band gap problem on defect levels. The more computationally economical PBE enables a systematic exploration of 3C-SiC defects, where, most notably, we find that the silicon vacancy undergoes Jahn-Teller-induced distortions from the previously assumed Td symmetry, and that the divacancy, like the silicon vacancy, exhibits a site-shift bistability in p-type conditions.
Sparsity, which occurs in both scientific applications and Deep Learning (DL) models, has been a key target of optimization within recent ASIC accelerators due to the potential memory and compute savings. These applications use data stored in a variety of compression formats. We demonstrate that both the compactness of different compression formats and the compute efficiency of the algorithms enabled by them vary across tensor dimensions and amount of sparsity. Since DL and scientific workloads span across all sparsity regions, there can be numerous format combinations for optimizing memory and compute efficiency. Unfortunately, many proposed accelerators operate on one or two fixed format combinations. This work proposes hardware extensions to accelerators for supporting numerous format combinations seamlessly and demonstrates ∼ 4 × speedup over performing format conversions in software.
We present a numerical framework for recovering unknown nonautonomous dynamical systems with time-dependent inputs. To circumvent the difficulty presented by the nonautonomous nature of the system, our method transforms the solution state into piecewise integration of the system over a discrete set of time instances. The time-dependent inputs are then locally parameterized by using a proper model, for example, polynomial regression, in the pieces determined by the time instances. This transforms the original system into a piecewise parametric system that is locally time invariant. We then design a deep neural network structure to learn the local models. Once the network model is constructed, it can be iteratively used over time to conduct global system prediction. We provide theoretical analysis of our algorithm and present a number of numerical examples to demonstrate the effectiveness of the method.