This report details work to study trade-offs in topology and network bandwidth for potential interconnects in the exascale (2021-2022) timeframe. The work was done using multiple interconnect models across two parallel discrete event simulators. Results from each independent simulator are shown and discussed and the areas of agreement and disagreement are explored.
In support of analyst requests for Mobile Guardian Transport studies, researchers at Sandia National Laboratories have expanded data types for the Slycat ensemble-analysis and visualization tool to include 3D surface meshes. This new capability represents a significant advance in our ability to perform detailed comparative analysis of simulation results. Analyzing mesh data rather than images provides greater flexibility for post-processing exploratory analysis.
If quantum information processors are to fulfill their potential, the diverse errors that affect them must be understood and suppressed. But errors typically fluctuate over time, and the most widely used tools for characterizing them assume static error modes and rates. This mismatch can cause unheralded failures, misidentified error modes, and wasted experimental effort. Here, we demonstrate a spectral analysis technique for resolving time dependence in quantum processors. Our method is fast, simple, and statistically sound. It can be applied to time-series data from any quantum processor experiment. We use data from simulations and trapped-ion qubit experiments to show how our method can resolve time dependence when applied to popular characterization protocols, including randomized benchmarking, gate set tomography, and Ramsey spectroscopy. In the experiments, we detect instability and localize its source, implement drift control techniques to compensate for this instability, and then demonstrate that the instability has been suppressed.
Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020
The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.
Alemazkoor, Negin; Rachunok, Benjamin; Chavas, Daniel R.; Staid, Andrea S.; Louhghalam, Arghavan; Nateghi, Roshanak; Tootkaboni, Mazdak
Nine in ten major outages in the US have been caused by hurricanes. Long-term outage risk is a function of climate change-triggered shifts in hurricane frequency and intensity; yet projections of both remain highly uncertain. However, outage risk models do not account for the epistemic uncertainties in physics-based hurricane projections under climate change, largely due to the extreme computational complexity. Instead they use simple probabilistic assumptions to model such uncertainties. Here, we propose a transparent and efficient framework to, for the first time, bridge the physics-based hurricane projections and intricate outage risk models. We find that uncertainty in projections of the frequency of weaker storms explains over 95% of the uncertainty in outage projections; thus, reducing this uncertainty will greatly improve outage risk management. We also show that the expected annual fraction of affected customers exhibits large variances, warranting the adoption of robust resilience investment strategies and climate-informed regulatory frameworks.
Partial differential equations (PDEs) are used with huge success to model phenomena across all scientific and engineering disciplines. However, across an equally wide swath, there exist situations in which PDEs fail to adequately model observed phenomena, or are not the best available model for that purpose. On the other hand, in many situations, nonlocal models that account for interaction occurring at a distance have been shown to more faithfully and effectively model observed phenomena that involve possible singularities and other anomalies. Here, we consider a generic nonlocal model, beginning with a short review of its definition, the properties of its solution, its mathematical analysis and of specific concrete examples. We then provide extensive discussions about numerical methods, including finite element, finite difference and spectral methods, for determining approximate solutions of the nonlocal models considered. In that discussion, we pay particular attention to a special class of nonlocal models that are the most widely studied in the literature, namely those involving fractional derivatives. The article ends with brief considerations of several modelling and algorithmic extensions, which serve to show the wide applicability of nonlocal modelling.
Tranchida, Julien G.; Dos Santos, Gonzalo; Aparicio, Romina; Linares, D.; Miranda, E.N.; Pastor, Gustavo M.; Bringa, Eduardo M.
In this paper, the magnetic behavior of bcc iron nanoclusters, with diameters between 2 and 8 nm, is investigated by means of spin dynamics simulations coupled to molecular dynamics, using a distance-dependent exchange interaction. Finite-size effects in the total magnetization as well as the influence of the free surface and the surface/core proportion of the nanoclusters are analyzed in detail for a wide temperature range, going beyond the cluster and bulk Curie temperatures. Comparison is made with experimental data and with theoretical models based on the mean-field Ising model adapted to small clusters, and taking into account the influence of low coordinated spins at free surfaces. Our results for the temperature dependence of the average magnetization per atom M (T), including the thermalization of the transnational lattice degrees of freedom, are in very good agreement with available experimental measurements on small Fe nanoclusters. In contrast, significant discrepancies with experiment are observed if the translational degrees of freedom are artificially frozen. The finite-size effects on M (T) are found to be particularly important near the cluster Curie temperature. Simulated magnetization above the Curie temperature scales with cluster size as predicted by models assuming short-range magnetic ordering. Analytical approximations to the magnetization as a function of temperature and size are proposed.
Antz, Hartwig; Boman, Erik G.; Gates, Mark; Kruger, Scott; Li, Sherry; Loe, Jennifer A.; Osei-Kuffuor, Daniel; Tomov, Stan; Tsai, Yaohung M.; Meier Yang, Ulrike
The use of multiple types of precision in mathematical software has the potential to increase its performance on new heterogeneous architectures. The xSDK project focuses both on the investigation and development of multiprecision algorithms as well as their inclusion into xSDK member libraries. This report summarizes current efforts on including and/or using mixed precision capabilities in the math libraries Ginkgo, heFFTe, hypre, MAGMA, PETSc/TAO, SLATE, SuperLU, and Trilinos, including KokkosKernels. It contains both numerical results from libraries that already provide mixed precision capabilities, as well as descriptions of the strategies to incorporate multiprecision into established libraries.
Neuromorphic computing is a critical future technology for the computing industry, but it has yet to achieve its promise and has struggled to establish a cohesive research community. A large part of the challenge is that full realization of the potential of brain inspiration requires advances in both device hardware, computing architectures, and algorithms. This simultaneous development across technology scales is unprecedented in the computing field. This article presents a strategy, framed by market and policy pressures, for moving past these current technological and cultural hurdles to realize its full impact across technology. Achieving the full potential of brain-derived algorithms as well as post-complementary metal-oxide-semiconductor (CMOS) scaling neuromorphic hardware requires appropriately balancing the near-term opportunities of deep learning applications with the long-term potential of less understood opportunities in neural computing.
The application of deep learning toward discovery of data-driven models requires careful application of inductive biases to obtain a description of physics which is both accurate and robust. We present here a framework for discovering continuum models from high fidelity molecular simulation data. Our approach applies a neural network parameterization of governing physics in modal space, allowing a characterization of differential operators while providing structure which may be used to impose biases related to symmetry, isotropy, and conservation form. Here, we demonstrate the effectiveness of our framework for a variety of physics, including local and nonlocal diffusion processes and single and multiphase flows. For the flow physics we demonstrate this approach leads to a learned operator that generalizes to system characteristics not included in the training sets, such as variable particle sizes, densities, and concentration.
Here, we describe recent efforts to improve our predictive modeling of rate-dependent behavior at, or near, a phase transition using molecular dynamics simulations. Cadmium sulfide (CdS) is a well-studied material that undergoes a solid-solid phase transition from wurtzite to rock salt structures between 3 and 9 GPa. Atomistic simulations are used to investigate the dominant transition mechanisms as a function of orientation, size and rate. We found that the final rock salt orientations were determined relative to the initial wurtzite orientation, and that these orientations were different for the two orientations and two pressure regimes studied. The CdS solid-solid phase transition is studied, for both a bulk single crystal and for polymer-encapsulated spherical nanoparticles of various sizes.
Signal arrival-time estimation plays a critical role in a variety of downstream seismic analyses, including location estimation and source characterization. Any arrival-time errors propagate through subsequent data-processing results. In this article, we detail a general framework for refining estimated seismic signal arrival times along with full estimation of their associated uncertainty. Using the standard short-term average/long-term average threshold algorithm to identify a search window, we demonstrate how to refine the pick estimate through two different approaches. In both cases, new waveform realizations are generated through bootstrap algorithms to produce full a posteriori estimates of uncertainty of onset arrival time of the seismic signal. The onset arrival uncertainty estimates provide additional data-derived information from the signal and have the potential to influence seismic analysis along several fronts.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Multithreaded MPI applications are gaining popularity in scientific and high-performance computing. While the combination of programming models is suited to support current parallel hardware, it moves threading models and their interaction with MPI into focus. With the advent of new threading libraries, the flexibility to select threading implementations of choice is becoming an important usability feature. Open MPI has traditionally avoided componentizing its threading model, relying on code inlining and static initialization to minimize potential impacts on runtime fast paths and synchronization. This paper describes the implementation of a generic threading runtime support in Open MPI using the Opal Modular Component Architecture. This architecture allows the programmer to select a threading library at compile-or run-time, providing both static initialization of threading primitives as well as dynamic instantiation of threading objects. In this work, we present the implementation, define required interfaces, and discuss trade-offs of dynamic and static initialization.
This report summarizes the work performed under the project "Linear Programming in Strongly Polynomial Time." Linear programming (LP) is a classic combinatorial optimization problem heavily used directly and as an enabling subroutine in integer programming (IP). Specifically IP is the same as LP except that some solution variables must take integer values (e.g. to represent yes/no decisions). Together LP and IP have many applications in resource allocation including general logistics, and infrastructure design and vulnerability analysis. The project was motivated by the PI's recent success developing methods to efficiently sample Voronoi vertices (essentially finding nearest neighbors in high-dimensional point sets) in arbitrary dimension. His method seems applicable to exploring the high-dimensional convex feasible space of an LP problem. Although the project did not provably find a strongly-polynomial algorithm, it explored multiple algorithm classes. The new medial simplex algorithms may still lead to solvers with improved provable complexity. We describe medial simplex algorithms and some relevant structural/complexity results. We also designed a novel parallel LP algorithm based on our geometric insights and implemented it in the Spoke-LP code. A major part of the computational step is many independent vector dot products. Our parallel algorithm distributes the problem constraints across processors. Current commercial and high-quality free LP solvers require all problem details to fit onto a single processor or multicore. Our new algorithm might enable the solution of problems too large for any current LP solvers. We describe our new algorithm, give preliminary proof-of-concept experiments, and describe a new generator for arbitrarily large LP instances.
Adams, Brian M.; Bohnhoff, William J.; Dalbey, Keith R.; Ebeida, Mohamed S.; Eddy, John P.; Eldred, Michael S.; Hooper, Russell W.; Hough, Patricia D.; Hu, Kenneth T.; Jakeman, John D.; Khalil, Mohammad; Maupin, Kathryn A.; Monschke, Jason A.; Ridgway, Elliott M.; Rushdi, Ahmad; Seidl, Daniel T.; Stephens, John A.; Winokur, Justin G.
The Dakota toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the Dakota toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user’s manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.
Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.
The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia for focused collaborative research on Department of Energy (DOE) computer and computational science problems. The institute provides an opportunity for university researchers to learn about problems in computer and computational science at DOE laboratories. Participants conduct leading-edge research, interact with scientists and engineers at the laboratories, and help transfer results of their research to programs at the labs. Some specific CSRI research interest areas are: scalable solvers, optimization, adaptivity and mesh refinement, graph-based, discrete, and combinatorial algorithms, uncertainty estimation, mesh generation, dynamic load-balancing, virus and other malicious-code defense, visualization, scalable cluster computers, data-intensive computing, environments for scalable computing, parallel input/output, advanced architectures, and theoretical computer science. The CSRI Summer Program is organized by CSRI and typically includes the organization of a weekly seminar series and the publication of a summer proceedings. In 2020, the CSRI summer program was executed completely virtually; all student interns worked from home, due to the COVID-19 pandemic.
Proceedings of IA3 2020: 10th Workshop on Irregular Applications: Architectures and Algorithms, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multi-GPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.