Distributed memory programming and multi-GPU Support with KOKKOS
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
International Conference for High Performance Computing, Networking, Storage and Analysis, SC
Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.
Research shows that individuals often overestimate their knowledge and performance without realizing they have done so, which can lead to faulty technical outcomes. This phenomenon is known as the Dunning-Kruger effect (Kruger & Dunning, 1999). This research sought to determine if some individuals were more prone to overestimating their performance due to underlying personality and cognitive characteristics. To test our hypothesis, we first collected individual difference measures. Next, we asked participants to estimate their performance on three performance tasks to assess the likelihood of overestimation. We found that some individuals may be more prone to overestimating their performance than others, and that faulty problem-solving abilities and low skill may be to blame. Encouraging individuals to think critically through all options and to consult with others before making a high-consequence decision may reduce overestimation.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Multithreaded MPI applications are gaining popularity in scientific and high-performance computing. While the combination of programming models is suited to support current parallel hardware, it moves threading models and their interaction with MPI into focus. With the advent of new threading libraries, the flexibility to select threading implementations of choice is becoming an important usability feature. Open MPI has traditionally avoided componentizing its threading model, relying on code inlining and static initialization to minimize potential impacts on runtime fast paths and synchronization. This paper describes the implementation of a generic threading runtime support in Open MPI using the Opal Modular Component Architecture. This architecture allows the programmer to select a threading library at compile-or run-time, providing both static initialization of threading primitives as well as dynamic instantiation of threading objects. In this work, we present the implementation, define required interfaces, and discuss trade-offs of dynamic and static initialization.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia for focused collaborative research on Department of Energy (DOE) computer and computational science problems. The institute provides an opportunity for university researchers to learn about problems in computer and computational science at DOE laboratories. Participants conduct leading-edge research, interact with scientists and engineers at the laboratories, and help transfer results of their research to programs at the labs. Some specific CSRI research interest areas are: scalable solvers, optimization, adaptivity and mesh refinement, graph-based, discrete, and combinatorial algorithms, uncertainty estimation, mesh generation, dynamic load-balancing, virus and other malicious-code defense, visualization, scalable cluster computers, data-intensive computing, environments for scalable computing, parallel input/output, advanced architectures, and theoretical computer science. The CSRI Summer Program is organized by CSRI and typically includes the organization of a weekly seminar series and the publication of a summer proceedings. In 2020, the CSRI summer program was executed completely virtually; all student interns worked from home, due to the COVID-19 pandemic.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report summarizes the work performed under the project "Linear Programming in Strongly Polynomial Time." Linear programming (LP) is a classic combinatorial optimization problem heavily used directly and as an enabling subroutine in integer programming (IP). Specifically IP is the same as LP except that some solution variables must take integer values (e.g. to represent yes/no decisions). Together LP and IP have many applications in resource allocation including general logistics, and infrastructure design and vulnerability analysis. The project was motivated by the PI's recent success developing methods to efficiently sample Voronoi vertices (essentially finding nearest neighbors in high-dimensional point sets) in arbitrary dimension. His method seems applicable to exploring the high-dimensional convex feasible space of an LP problem. Although the project did not provably find a strongly-polynomial algorithm, it explored multiple algorithm classes. The new medial simplex algorithms may still lead to solvers with improved provable complexity. We describe medial simplex algorithms and some relevant structural/complexity results. We also designed a novel parallel LP algorithm based on our geometric insights and implemented it in the Spoke-LP code. A major part of the computational step is many independent vector dot products. Our parallel algorithm distributes the problem constraints across processors. Current commercial and high-quality free LP solvers require all problem details to fit onto a single processor or multicore. Our new algorithm might enable the solution of problems too large for any current LP solvers. We describe our new algorithm, give preliminary proof-of-concept experiments, and describe a new generator for arbitrarily large LP instances.
Abstract not provided.
Proceedings of MCHPC 2020: Workshop on Memory Centric High Performance Computing, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.
Proceedings of PMBS 2020: Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.
Abstract not provided.
Results in Applied Mathematics
In this paper, we present an optimization-based coupling method for local and nonlocal continuum models. Our approach couches the coupling of the models into a control problem where the states are the solutions of the nonlocal and local equations, the objective is to minimize their mismatch on the overlap of the local and nonlocal problem domains, and the virtual controls are the nonlocal volume constraint and the local boundary condition. We present the method in the context of Local-to-Nonlocal di usion coupling. Numerical examples illustrate the theoretical properties of the approach.
Journal of Applied Physics
The lack of a reliable method to evaluate the convergence of molecular dynamics simulations has contributed to discrepancies in different areas of molecular dynamics. In the present work, the method of information entropy is introduced to molecular dynamics for stationarity assessment. The Shannon information entropy formalism is used to monitor the convergence of the atom motion to a steady state in a continuous spatial domain and is also used to assess the stationarity of calculated multidimensional fields such as the temperature field in a discrete spatial domain. It is demonstrated in this work that monitoring the information entropy of the atom position matrix provides a clear indicator of reaching steady state in radiation damage simulations, non-equilibrium molecular dynamics thermal conductivity computations, and simulations of Poiseuille and Couette flow in nanochannels. A main advantage of the present technique is that it is non-local and relies on fundamental quantities available in all molecular dynamics simulations. Unlike monitoring average temperature, the technique is applicable to simulations that conserve total energy such as reverse non-equilibrium molecular dynamics thermal conductivity computations and to simulations where energy dissipates through a boundary as in radiation damage simulations. The method is applied to simulations of iron using the Tersoff/ZBL splined potential, silicon using the Stillinger-Weber potential, and to Lennard-Jones fluid. Its applicability to both solids and fluids shows that the technique has potential for generalization to other areas in molecular dynamics.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report summarizes work done under the Verification, Validation, and Uncertainty Quantification (VVUQ) thrust area of the North American Energy Resilience Model (NAERM) Program. The specific task of interest described in this report is focused on sensitivity analysis of scenarios involving failures of both wind turbines and thermal generators under extreme cold-weather temperature conditions as would be observed in a Polar Vortex event.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Physics of Plasmas
Understanding the effects of contaminant plasmas generated within the Z machine at Sandia is critical to understanding current loss mechanisms. The plasmas are generated at the accelerator electrode surfaces and include desorbed species found in the surface and substrate of the walls. These desorbed species can become ionized. The timing and location of contaminant species desorbed from the wall surface depend non-linearly on the local surface temperature. For accurate modeling, it is necessary to utilize wall heating models to estimate the amount and timing of material desorption. One of these heating mechanisms is Joule heating. We propose several extended semi-analytic magnetic diffusion heating models for computing surface Joule heating and demonstrate their effects for several representative current histories. We quantitatively assess under what circumstances these extensions to classical formulas may provide a validatable improvement to the understanding of contaminant desorption timing.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Computer Methods in Applied Mechanics and Engineering
This work presents the design of nonlinear stabilization techniques for the finite element discretization of Euler equations in both steady and transient form. Implicit time integration is used in the case of the transient form. A differentiable local bounds preserving method has been developed, which combines a Rusanov artificial diffusion operator and a differentiable shock detector. Nonlinear stabilization schemes are usually stiff and highly nonlinear. This issue is mitigated by the differentiability properties of the proposed method. Moreover, in order to further improve the nonlinear convergence, we also propose a continuation method for a subset of the stabilization parameters. The resulting method has been successfully applied to steady and transient problems with complex shock patterns. Numerical experiments show that it is able to provide sharp and well resolved shocks. The importance of the differentiability is assessed by comparing the new scheme with its non-differentiable counterpart. Numerical experiments suggest that, for up to moderate nonlinear tolerances, the method exhibits improved robustness and nonlinear convergence behavior for steady problems. In the case of transient problem, we also observe a reduction in the computational cost.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report is an institutional record of experiments conducted to explore performance of a vendor installation of CephFS on the SNL stria cluster. Comparisons between CephFS, the Lustre parallel file system, and NFS were done using the IOR and MDTEST benchmarking tools, a test program which uses the SEACAS/Trilinos IOSS library, and the checkpointing activity performed by the LAMMPS molecular dynamics simulation.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
A technique called the splice method for coupling local to peridynamic subregions of a body is described. The method relies on ghost nodes, whose values of displacement are interpolated from nearby physical nodes, to make each subregion visible to the other. In each time step, the nodes in each subregion treat the nodes in the other subregion as boundary conditions. Adaptively changing the subregions is possible through the creation and deletion of ghost nodes. Example problems in 2D and 3D illustrate how the method is used to perform multiscale modeling of fracture and impact events within a larger structure.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Quantum Science and Technology
PyGSTi is a Python software package for assessing and characterizing the performance of quantum computing processors. It can be used as a standalone application, or as a library, to perform a wide variety of quantum characterization, verification, and validation (QCVV) protocols on as-built quantum processors. We outline pyGSTi's structure, and what it can do, using multiple examples. We cover its main characterization protocols with end-to-end implementations. These include gate set tomography, randomized benchmarking on one or many qubits, and several specialized techniques. We also discuss and demonstrate how power users can customize pyGSTi and leverage its components to create specialized QCVV protocols and solve user-specific problems.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.