We seek scalable benchmarks for entity resolution problems. Solutions to these problems range from trivial approaches such as string sorting to sophisticated methods such as statistical relational learning. The theoretical and practical complexity of these approaches varies widely, so one of the primary purposes of a benchmark will be to quantify the trade-off between solution quality and runtime. We are motivated by the ubiquitous nature of entity resolution as a fundamental problem faced by any organization that ingests large amounts of noisy text data. A benchmark is typically a rigid specification that provides an objective measure usable for ranking implementations of an algorithm. For example the Top500 and HPCG500 bench- marks rank supercomputers based on their performance of dense and sparse linear algebra problems (respectively). These two benchmarks require participants to report FLOPS counts attainable on various machines. Our purpose is slightly different. Whereas the supercomputing benchmarks mentioned above hold algorithms constant and aim to rank machines, we are primarily interested in ranking algorithms. As mentioned above, entity resolution problems can be approached in completely different ways. We believe that users of our benchmarks must decide what sort of procedure to run before comparing implementations and architectures. Eventually, we also wish to provide a mechanism for ranking machines while holding algorithmic approach constant . Our primary contributions are parallel algorithms for computing solution quality mea- sures per entity. We find in some real datasets that many entities are quite easy to resolve while others are difficult, with a heavy skew toward the former case. Therefore, measures such as global confusion matrices, F measures, etc. do not meet our benchmarking needs. We design methods for computing solution quality at the granularity of a single entity in order to know when proposed solutions do well in difficult situations (perhaps justifying extra computational), or struggling in easy situations. We report on progress toward a viable benchmark for comparing entity resolution algo- rithms. Our work is incomplete, but we have designed and prototyped several algorithms to help evalute the solution quality of competing approaches to these problems. We envision a benchmark in which the objective measure is a ratio of solution quality to runtime.
This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.
A compelling narrative has taken hold as quantum computing explodes into the commercial sector: Quantum computing in 2018 is like classical computing in 1965. In 1965 Gordon Moore wrote his famous paper about integrated circuits, saying: "At present, [minimum cost] is reached when 50 components are used per circuit. But... the complexity for minimum component costs has increased at a rate of roughly a factor of two per year... by 1975, the number of components per integrated circuit for minimum cost will be 65,000." This narrative is both appealing (we want to believe that quantum computing will follow the incredibly successful path of classical computing!) and plausible (2018 saw IBM, Intel, and Google announce 50-qubit integrated chips). But it is also deeply misleading. Here is an alternative: Quantum computing in 2018 is like classical computing in 1938. In 1938, John Atanasoff and Clifford Berry built the very first electronic digital computer. It had no program, and was not Turing-complete. Vacuum tubes — the standard "bit" for 20 years — were still 5 years in the future. ENIAC and the achievement of "computational supremacy" (over hand calculation) wouldn't arrive for 8 years, despite the accelerative effect of WWII. Integrated circuits and the information age were more than 20 years away. Neither of these analogies is perfect. Quantum computing technology is more like 1938, while the level of funding and excitement suggest 1965 (or later!). But the point of the cautionary analogy to 1938 is simple: Quantum computing in 2018 is a research field. It is far too early to establish metrics or benchmarks for performance. The best role for neutral organizations like IEEE is to encourage and shape research into metrics and benchmarks, so as to be ready when they become necessary. This white paper presents the evidence and reasoning for this claim. We explain what it means to say that quantum computing is a "research field", and why metrics and benchmarks for quantum processors also constitute a research field. We discuss the potential for harmful consequences of prematurely establishing standards or frameworks. We conclude by suggesting specific actions that IEEE or similar organizations can take to accelerate the development of good metrics and benchmarks for quantum computing.
Molecular dynamics simulations are carried out to characterize irradiation effects in TiO2 rutile, for wide ranges of temperatures (300-900 K) and primary knock-on atom (PKA) energies (1-10 keV). The number of residual defects decreases with increased temperature and decreased PKA energy, but is independent of PKA type. In the ballistic phase, more oxygen than titanium defects are produced, however, the primary residual defects are titanium vacancies and interstitials. Defect clustering depends on the PKA energy, temperature, and defect production. For some 10 keV PKAs, the largest cluster of vacancies at the peak of the ballistic phase and after annealing has up to ≈1200 and 100 vacancies, respectively. For the 10 keV PKAs at 300 K, the energy storage, primarily in residual Ti vacancies and interstitials, is estimated at 140-310 eV. It decreases with increased temperature to as little as 5-180 eV at 900 K. Selected area electron diffraction patterns and radial distribution functions confirm that although localized amorphous regions form during the ballistic phase, TiO2 regains full crystallinity after annealing.
We study two inexact methods for solutions of random eigenvalue problems in the context of spectral stochastic finite elements. In particular, given a parameter-dependent, symmetric matrix operator, the methods solve for eigenvalues and eigenvectors represented using polynomial chaos expansions. Both methods are based on the stochastic Galerkin formulation of the eigenvalue problem and they exploit its Kronecker-product structure. The first method is an inexact variant of the stochastic inverse subspace iteration [B. Sousedfk, H. C. Elman, SIAM/ASA Journal on Uncertainty Quantification 4(1), pp. 163-189, 2016]. The second method is based on an inexact variant of Newton iteration. In both cases, the problems are formulated so that the associated stochastic Galerkin matrices are symmetric, and the corresponding linear problems are solved using preconditioned Krylov subspace methods with several novel hierarchical preconditioners. The accuracy of the methods is compared with that of Monte Carlo and stochastic collocation, and the effectiveness of the methods is illustrated by numerical experiments.
Deep neural networks (DNN) now outperform competing methods in many academic and industrial domains. These high-capacity universal function approximators have recently been leveraged by deep reinforcement learning (RL) algorithms to obtain impressive results for many control and decision making problems. During the past three years, research toward pruning, quantization, and compression of DNNs has reduced the mathematical, and therefore time and energy, requirements of DNN-based inference. For example, DNN optimization techniques have been developed which reduce storage requirements of VGG-16 from 552MB to 11.3MB, while maintaining the full-model accuracy for image classification. Building from DNN optimization results, the computer architecture community is taking increasing interest in exploring DNN hardware accelerator designs. Based on recent deep RL performance, we expect hardware designers to begin considering architectures appropriate for accelerating these algorithms too. However, it is currently unknown how, when, or if the 'noise' introduced by DNN optimization techniques will degrade deep RL performance. This work measures these impacts, using standard OpenAI Gym benchmarks. Our results show that mathematically optimized RL policies can perform equally to full-precision RL, while requiring substantially less computation. We also observe that different optimizations are better suited than others for different problem domains. By beginning to understand the impacts of mathematical optimizations on RL policy performance, this work serves as a starting point toward the development of low power or high performance deep RL accelerators.
Holes in germanium-rich heterostructures provide a compelling alternative for achieving spin based qubits compared to traditional approaches such as electrons in silicon. In this project, we addressed the question of whether holes in Ge/SiGe quantum wells can be confined into laterally defined quantum dots and made into qubits. Through this effort, we successfully fabricated and operated single-metal-layer quantum dot devices in Ge/SiGe in multiple devices. For single quantum dots, we measured the capacitances of the quantum dot to the surface electrodes and find that they reasonably compare to expected values based on the electrode dimensions, suggested that we have formed a lithographic quantum dot. We also compare the results to detailed self-consistent calculations of the expected potential. Finally, we demonstrate, for the first time, a double quantum dot in the Ge/SiGe material system.
This study explores a Bayesian calibration framework for the RAMPAGE alloy potential model for Cu-Ni and Cu-Zr systems, respectively. In RAMPAGE potentials, it is proposed that once calibrated potentials for individual elements are available, the inter-species interactions can be described by fitting a Morse potential for pair interactions with three parameters, while densities for the embedding function can be scaled by two parameters from the elemental densities. Global sensitivity analysis tools were employed to understand the impact each parameter has on the MD simulation results. A transitional Markov Chain Monte Carlo algorithm was used to generate samples from the multimodal posterior distribution consistent with the discrepancy between MD simulation results and DFT data. For the Cu-Ni system the posterior predictive tests indicate that the fitted interatomic potential model agrees well with the DFT data, justifying the basic RAMPAGE assumptions. For the Cu-Zr system, where the phase diagram suggests more complicated atomic interactions than in the case of Cu-Ni, the RAMPAGE potential captured only a subset of the DFT data. The resulting posterior distribution for the 5 model parameters exhibited several modes, with each mode corresponding to specific simulation data and a suboptimal agreement with the DFT results.
In this report, we present preliminary research into nonparametric clustering methods for multi-source imagery data and quantifying the performance of these models. In many domain areas, data sets do not necessarily follow well-defined and well-known probability distributions, such as the normal, gamma, and exponential. This is especially true when combining data from multiple sources describing a common set of objects (which we call multimodal analysis), where the data in each source can follow different distributions and need to be analyzed in conjunction with one another. This necessitates nonparametric density estimation methods, which allow the data to better dictate the distribution of the data. One prominent example of multimodal analysis is multimodal image analysis, when we analyze multiple images taken using different radar systems of the same scene of interest. We develop uncertainty analysis methods, which are inherent in the use of probabilistic models but often not taken advance of, to assess the performance of probabilistic clustering methods used for analyzing multimodal images. This added information helps assess model performance and how much trust decision-makers should have in the obtained analysis results. The developed methods illustrate some ways in which uncertainty can inform decisions that arise when designing and using machine learning models.
Scope and Objectives: Kokkos Support provides cyber resources and conducts training events for current and prospective Kokkos users; In person training events are organized in various venues providing both generic Kokkos tutorials with lectures and exercises, as well as hands-on work on users applications.
Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, KKSPGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.
ParaView Catalyst is an API for accessing the scalable visualization infrastructure of ParaView in an in-situ context. In-situ visualization allows simulation codes to access data post-processing operations while the simulation is running. In-situ techniques can reduce data post-processing time, allow computational steering, and increase the resolution and frequency of data output. For a simulation code to use ParaView Catalyst, adapter code needs to be created that interfaces the simulations data structures to ParaView/VTK data structures. Under ATDM, Catalyst is to be integrated with SPARC, a code used for simulation of unsteady reentry vehicle flow.
Prokopenko, Andrey; Thomas, Stephen; Swirydowicz, Kasia; Ananthan, Shreyas; Hu, Jonathan J.; Williams, Alan B.; Sprague, Michael
The goal of the ExaWind project is to enable predictive simulations of wind farms composed of many MW-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary code in the ExaWind project is Nalu, which is an unstructured-grid solver for the acousticallyincompressible Navier-Stokes equations, and mass continuity is maintained through pressure projection. The model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as re-initialization of matrices and re-computation of preconditioners is required at every time step In this Milestone, we examine the effect of threading on the solver stack performance against flat-MPI results obtained from previous milestones using Haswell performance data full-turbine simulations. Whereas the momentum equations are solved only with the Trilinos solvers, we investigate two algebraic-multigrid preconditioners for the continuity equations: Trilinos/Muelu and HYPRE/BoomerAMG. These two packages embody smoothed-aggregation and classical Ruge-Stiiben AMG methods, respectively. In our FY18 Q2 report, we described our efforts to improve setup and solve of the continuity equations under flat-MPI parallelism. While significant improvement was demonstrated in the solve phase, setup times remained larger than expected. Starting with the optimized settings described in the Q2 report, we explore here simulation performance where OpenMP threading is employed in the solver stack. For Trilinos, threading is acheived through the Kokkos abstraction where, whereas HYPRE/BoomerAMG employs straight OpenMP. We examined results for our mid-resolution baseline turbine simulation configuration (229M DOF). Simulations on 2048 Haswell cores explored the effect of decreasing the number of MPI ranks while increasing the number of threads. Both HYPRE and Trilinos exhibited similar overal solution times, and both showed dramatic increases in simulation time in the shift from MPI ranks to OpenMP threads. This increase is attributed to the large amount of work per MPI rank starting at the single-thread configuration. Decreasing MPI ranks, while increasing threads, may be increasing simulation time due to thread synchronization and start-up overhead contributing to the latency and serial time in the model. These result showed that an MPI+OpenMP parallel decomposition will be more effective as the amount per MPI rank computation per MPI rank decreases and the communication latency increases. This idea was demonstrated in a strong scaling study of our low-resolution baseline model (29M DOF) with the Trilinos-HYPRE configuration. While MPI-only results showed scaling improvement out to about 1536 cores, engaging threading carried scaling improvements out to 4128 cores — roughly 7000 DOF per core. This is an important result as improved strong scaling is needed for simulations to be executed over sufficiently long simulated durations (i.e., for many timesteps). In addition to threading work described above, the team examined solver-performance improvements by exploring communication-overhead in the HYPRE-GMRES implementation through a communicationoptimal- GMRE algorithm (CO-GMRES), and offloading compute-intensive solver actions to GPUs. To those ends, a HYPRE mini-app was allow us to easily test different solver approaches and HYPRE parameter settings without running the entire Nalu code. With GPU acceleration on the Summitdev supercomputer, a 20x speedup was achieved for the overall preconditioner and solver execution time for the mini-app. A study on Haswell processors showed that CO-GMRES provides benefits as one increases MPI ranks.
This final report summarizes the results of the Laboratory Directed Research and Devel- opment (LDRD) Project Number 212587 entitled "Modeling Charged Defects in Non-Cubic Semiconductors for Radiation Effects Studies in Next Generation Electronic Materials" . The goal of this project was to extend a predictive capability for modeling defect level energies using first principle density functional theory methods (e.g., for radiation effects assessments) to semiconductors with non-cubic crystal structures. Computational methods that proved accurate for predicting defect levels in standard cubic semiconductors, were found to have shortcomings when applied to the lowered symmetry structures prevalent in next generation electronic materials such as SiC, GaN, and Ga203, stemming from an error in the treatment of the electrostatic boundary conditions. I describe methods to generalized the local moment countercharge (LMCC) scheme to position a charge in bulk supercell calculations of charged defects, circumventing the problem of measuring a dipole in a periodically replicated bulk calculation.
This SAND report fulfills the final report requirement for the Born Qualified Grand Challenge LDRD. Born Qualified was funded from FY16-FY18 with a total budget of ~$13M over the 3 years of funding. Overall 70+ staff, Post Docs, and students supported this project over its lifetime. The driver for Born Qualified was using Additive Manufacturing (AM) to change the qualification paradigm for low volume, high value, high consequence, complex parts that are common in high-risk industries such as ND, defense, energy, aerospace, and medical. AM offers the opportunity to transform design, manufacturing, and qualification with its unique capabilities. AM is a disruptive technology, allowing the capability to simultaneously create part and material while tightly controlling and monitoring the manufacturing process at the voxel level, with the inherent flexibility and agility in printing layer-by-layer. AM enables the possibility of measuring critical material and part parameters during manufacturing, thus changing the way we collect data, assess performance, and accept or qualify parts. It provides an opportunity to shift from the current iterative design-build-test qualification paradigm using traditional manufacturing processes to design-by-predictivity where requirements are addressed concurrently and rapidly. The new qualification paradigm driven by AM provides the opportunity to predict performance probabilistically, to optimally control the manufacturing process, and to implement accelerated cycles of learning. Exploiting these capabilities to realize a new uncertainty quantification-driven qualification that is rapid, flexible, and practical is the focus of this effort.
A key component of most large-scale rendering systems is a parallel image compositing algorithm, and the most commonly used compositing algorithms are binary swap and its variants. Although shown to be very efficient, one of the classic limitations of binary swap is that it only works on a number of processes that is a perfect power of 2. Multiple variations of binary swap have been independently introduced to overcome this limitation and handle process counts that have factors that are not 2. To date, few of these approaches have been directly compared against each other, making it unclear which approach is best. This paper presents a fresh implementation of each of these methods using a common software framework to make them directly comparable. These methods to run binary swap with odd factors are directly compared. The results show that some simple compositing approaches work as well or better than more complex algorithms that are more difficult to implement.
File fragment classification is an important step in the task of file carving in digital forensics. In file carving, files must be reconstructed based on their content as a result of their fragmented storage on disk or in memory. Existing methods for classification of file fragments typically use hand-engineered features, such as byte histograms or entropy measures. In this paper, we propose an approach using sparse coding that enables automated feature extraction. Sparse coding, or sparse dictionary learning, is an unsupervised learning algorithm, and is capable of extracting features based simply on how well those features can be used to reconstruct the original data. With respect to file fragments, we learn sparse dictionaries for n-grams, continuous sequences of bytes, of different sizes. These dictionaries may then be used to estimate n-gram frequencies for a given file fragment, but for significantly larger n-gram sizes than are typically found in existing methods which suffer from combinatorial explosion. To demonstrate the capability of our sparse coding approach, we used the resulting features to train standard classifiers, such as support vector machines over multiple file types. Experimentally, we achieved significantly better classification results with respect to existing methods, especially when the features were used in supplement to existing hand-engineered features.
We propose a functional integral framework for the derivation of hierarchies of Landau-Lifshitz-Bloch (LLB) equations that describe the flow toward equilibrium of the first and second moments of the magnetization. The short-scale description is defined by the stochastic Landau-Lifshitz-Gilbert equation, under both Markovian or non-Markovian noise, and takes into account interaction terms that are of practical relevance. Depending on the interactions, different hierarchies on the moments are obtained in the corresponding LLB equations. Two closure Ansätze are discussed and tested by numerical methods that are adapted to the symmetries of the problem. Our formalism provides a rigorous bridge between the atomistic spin dynamics simulations at short scales and micromagnetic descriptions at larger scales.
Trusting simulation output is crucial for Sandia's mission objectives. We rely on these simulations to perform our high-consequence mission tasks given our treaty obligations. Other science and modelling needs, while they may not be high-consequence, still require the strongest levels of trust to enable using the result as the foundation for both practical applications and future research. To this end, the computing community has developed work- flow and provenance systems to aid in both automating simulation and modelling execution, but to also aid in determining exactly how was some output created so that conclusions can be drawn from the data. Current approaches for workflows and provenance systems are all at the user level and have little to no system level support making them fragile, difficult to use, and incomplete solutions. The introduction of container technology is a first step towards encapsulating and tracking artifacts used in creating data and resulting insights, but their current implementation is focused solely on making it easy to deploy an application in an isolated "sandbox" and maintaining a strictly read-only mode to avoid any potential changes to the application. All storage activities are still using the system-level shared storage. This project was an initial exploration into extending the container concept to also include storage and to use writable containers, auto generated by the system, as a way to link the contained data back to the simulation and input deck used to create it.
Reproducibility is an essential ingredient of the scientific enterprise. The ability to reproduce results builds trust that we can rely on the results as foundations for future scientific exploration. Presently, the fields of computational and computing sciences provide two opposing definitions of reproducible and replicable. In computational sciences, reproducible research means authors provide all necessary data and computer codes to run analyses again, so others can re-obtain the results (J. Claerbout et al., 1992). The concept was adopted and extended by several communities, where it was distinguished from replication: collecting new data to address the same question, and arriving at consistent findings (Peng et al. 2006). The Association of Computing Machinery (ACM), representing computer science and industry professionals, recently established a reproducibility initiative, adopting essentially opposite definitions. The purpose of this report is to raise awareness of the opposite definitions and propose a path to a compatible taxonomy.
This report summarizes the data analysis activities that were performed under the Born Qualified Grand Challenge Project from 2016 - 2018. It is meant to document the characterization of additively manufactured parts and processe s for this project as well as demonstrate and identify further analyses and data science that could be done relating material processes to microstructure to properties to performance.
This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to develop a method for analyzing multiple time-series data streams to identify precursors providing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub-missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.
Running visualization and analysis algorithms on ATS-1 platforms is a critical step for supporting ATDM apps at the exascale. We are leveraging VTK-m to port our algorithms to the ATS-specific hardware and ensuring that it runs well.
Neural-inspired spike-based computing machines often claim to achieve considerable advantages in terms of energy and time efficiency by using spikes for computation and communication. However, fundamental questions about spike-based computation remain unanswered. For instance, how much advantage do spike-based approaches have over conventionalmethods, and underwhat circumstances does spike-based computing provide a comparative advantage? Simply implementing existing algorithms using spikes as the medium of computation and communication is not guaranteed to yield an advantage. Here, we demonstrate that spike-based communication and computation within algorithms can increase throughput, and they can decrease energy cost in some cases. We present several spiking algorithms, including sorting a set of numbers in ascending/descending order, as well as finding the maximum or minimum ormedian of a set of numbers.We also provide an example application: a spiking median-filtering approach for image processing providing a low-energy, parallel implementation. The algorithms and analyses presented here demonstrate that spiking algorithms can provide performance advantages and offer efficient computation of fundamental operations useful in more complex algorithms.
An analysis of microgrids to increase resilience was conducted for the island of Puerto Rico. Critical infrastructure throughout the island was mapped to the key services provided by those sectors to help inform primary and secondary service sources during a major disruption to the electrical grid. Additionally, a resilience metric of burden was developed to quantify community resilience, and a related baseline resilience figure was calculated for the area. To improve resilience, Sandia performed an analysis of where clusters of critical infrastructure are located and used these suggested resilience node locations to create a portfolio of 159 microgrid options throughout Puerto Rico. The team then calculated the impact of these microgrids on the region's ability to provide critical services during an outage, and compared this impact to high-level estimates of cost for each microgrid to generate a set of efficient microgrid portfolios costing in the range of 218-917M dollars. This analysis is a refinement of the analysis delivered on June 01, 2018.
Concurrency and Computation. Practice and Experience
Bernholdt, David E.; Boehm, Swen; Bosilca, George; Venkata, Manjunath G.; Grant, Ryan E.; Naughton, Thomas; Pritchard, Howard P.; Schulz, Martin; Vallee, Geoffroy R.
The Exascale Computing Project (ECP) is currently the primary effort in the United States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain a more thorough understanding of how the software projects under the ECP are using, and planning to use the Message Passing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey. Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were using MPI. Furthermore, this paper reports the results of that survey for the benefit of the broader community of MPI developers.