Publications

Results 1101–1150 of 9,998

Search results

Jump to search filters

Cache Oblivious Strategies to Exploit Multi-Level Memory on Manycore Systems

Proceedings of MCHPC 2020: Workshop on Memory Centric High Performance Computing, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Butcher, Neil A.; Olivier, Stephen L.; Kogge, Peter M.

Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.

More Details

Distributed Memory Graph Coloring Algorithms for Multiple GPUs

Proceedings of IA3 2020: 10th Workshop on Irregular Applications: Architectures and Algorithms, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Bogle, Ian; Boman, Erik G.; Devine, Karen; Rajamanickam, Sivasankaran; Slota, George M.

Graph coloring is often used in parallelizing scientific computations that run in distributed and multi-GPU environments; it identifies sets of independent data that can be updated in parallel. Many algorithms exist for graph coloring on a single GPU or in distributed memory, but hybrid MPI+GPU algorithms have been unexplored until this work, to the best of our knowledge. We present several MPI+GPU coloring approaches that use implementations of the distributed coloring algorithms of Gebremedhin et al. and the shared-memory algorithms of Deveci et al. The on-node parallel coloring uses implementations in KokkosKernels, which provide parallelization for both multicore CPUs and GPUs. We further extend our approaches to solve for distance-2 coloring, giving the first known distributed and multi-GPU algorithm for this problem. In addition, we propose novel methods to reduce communication in distributed graph coloring. Our experiments show that our approaches operate efficiently on inputs too large to fit on a single GPU and scale up to graphs with 76.7 billion edges running on 128 GPUs.

More Details

Interface Flux Recovery coupling method for the ocean–atmosphere system

Results in Applied Mathematics

Sockwell, Kenneth C.; Peterson, Kara J.; Kuberry, Paul; Bochev, Pavel; Trask, Nathaniel A.

Component coupling is a crucial part of climate models, such as DOE's E3SM (Caldwell et al., 2019). A common coupling strategy in climate models is for their components to exchange flux data from the previous time-step. This approach effectively performs a single step of an iterative solution method for the monolithic coupled system, which may lead to instabilities and loss of accuracy. In this paper we formulate an Interface-Flux-Recovery (IFR) coupling method which improves upon the conventional coupling techniques in climate models. IFR starts from a monolithic formulation of the coupled discrete problem and then uses a Schur complement to obtain an accurate approximation of the flux across the interface between the model components. This decouples the individual components and allows one to solve them independently by using schemes that are optimized for each component. To demonstrate the feasibility of the method, we apply IFR to a simplified ocean–atmosphere model for heat-exchange coupled through the so-called bulk condition, common in ocean–atmosphere systems. We then solve this model on matching and non-matching grids to estimate numerically the convergence rates of the IFR coupling scheme.

More Details

Radd runtimes: Radical and different distributed runtimes with smartnics

Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Grant, Ryan; Schonbein, Whit; Levy, Scott

As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.

More Details

Extrapolation of thermal conductivity in non-equilibrium molecular dynamics simulations to bulk scale

International Communications in Heat and Mass Transfer

Talaat, Khaled; El-Genk, Mohamed S.; Cowen, Benjamin

Predictions of the bulk scale thermal conductivity of solids using non-equilibrium molecular dynamics (MD) simulations have relied on the linear extrapolation of the thermal resistivity versus the reciprocal of the system length in the simulations. Several studies have reported deviation of the extrapolation from linearity near the micro-scale, raising a concern of its applicability to large systems. To investigate this issue, present work conducted extensive MD simulations of silicon with two different potentials (EDIP and Tersoff-II) for unprecedented length scales up to 10.3 μm and simulation times up to 530 ns. For large systems ≥0.35 μm in size the non-linearity of the extrapolation of the reciprocal of the thermal conductivity is mostly due to ignoring the dependence of the thermal conductivity on temperature. To account for such dependence, the present analysis fixes the temperature range for determining the gradient for calculating the thermal conductivity values. However, short systems ≤0.23 μm in size show significant non-linearity in the calculated thermal conductivity values using a temperature window of 500 ± 10 K from the simulations results with the EDIP potential. Since these system sizes are shorter than the mean phonon free path in EDIP (~0.22 μm), the nonlinearity may be attributed to phonon transport. For the MD simulations with the Tersoff-II potential there is no significant non-linearity in the calculated thermal conductivity values for systems ranging in size from 0.05 to 5.4 μm.

More Details

Novel Geometric Operations for Linear Programming

Ebeida, Mohamed S.; Abdelkader, Ahmed; Amenta, Nina; Kouri, Drew P.; Parekh, Ojas D.; Phillips, Cynthia A.; Winovich, Nickolas

This report summarizes the work performed under the project "Linear Programming in Strongly Polynomial Time." Linear programming (LP) is a classic combinatorial optimization problem heavily used directly and as an enabling subroutine in integer programming (IP). Specifically IP is the same as LP except that some solution variables must take integer values (e.g. to represent yes/no decisions). Together LP and IP have many applications in resource allocation including general logistics, and infrastructure design and vulnerability analysis. The project was motivated by the PI's recent success developing methods to efficiently sample Voronoi vertices (essentially finding nearest neighbors in high-dimensional point sets) in arbitrary dimension. His method seems applicable to exploring the high-dimensional convex feasible space of an LP problem. Although the project did not provably find a strongly-polynomial algorithm, it explored multiple algorithm classes. The new medial simplex algorithms may still lead to solvers with improved provable complexity. We describe medial simplex algorithms and some relevant structural/complexity results. We also designed a novel parallel LP algorithm based on our geometric insights and implemented it in the Spoke-LP code. A major part of the computational step is many independent vector dot products. Our parallel algorithm distributes the problem constraints across processors. Current commercial and high-quality free LP solvers require all problem details to fit onto a single processor or multicore. Our new algorithm might enable the solution of problems too large for any current LP solvers. We describe our new algorithm, give preliminary proof-of-concept experiments, and describe a new generator for arbitrarily large LP instances.

More Details

CSRI Summer Proceedings 2020

Rushdi, Ahmad

The Computer Science Research Institute (CSRI) brings university faculty and students to Sandia for focused collaborative research on Department of Energy (DOE) computer and computational science problems. The institute provides an opportunity for university researchers to learn about problems in computer and computational science at DOE laboratories. Participants conduct leading-edge research, interact with scientists and engineers at the laboratories, and help transfer results of their research to programs at the labs. Some specific CSRI research interest areas are: scalable solvers, optimization, adaptivity and mesh refinement, graph-based, discrete, and combinatorial algorithms, uncertainty estimation, mesh generation, dynamic load-balancing, virus and other malicious-code defense, visualization, scalable cluster computers, data-intensive computing, environments for scalable computing, parallel input/output, advanced architectures, and theoretical computer science. The CSRI Summer Program is organized by CSRI and typically includes the organization of a weekly seminar series and the publication of a summer proceedings. In 2020, the CSRI summer program was executed completely virtually; all student interns worked from home, due to the COVID-19 pandemic.

More Details

Dakota, A Multilevel Parallel Object-Oriented Framework for Design Optimization, Parameter Estimation, Uncertainty Quantification, and Sensitivity Analysis: Version 6.13 User's Manual

Adams, Brian M.; Bohnhoff, William J.; Dalbey, Keith R.; Ebeida, Mohamed S.; Eddy, John P.; Eldred, Michael S.; Hooper, Russell W.; Hough, Patricia D.; Hu, Kenneth T.; Jakeman, John D.; Khalil, Mohammad; Maupin, Kathryn A.; Monschke, Jason A.; Ridgway, Elliott M.; Rushdi, Ahmad; Seidl, Daniel T.; Stephens, John A.; Winokur, Justin G.

The Dakota toolkit provides a flexible and extensible interface between simulation codes and iterative analysis methods. Dakota contains algorithms for optimization with gradient and nongradient-based methods; uncertainty quantification with sampling, reliability, and stochastic expansion methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the Dakota toolkit provides a flexible and extensible problem-solving environment for design and performance analysis of computational models on high performance computers. This report serves as a user’s manual for the Dakota software and provides capability overviews and procedures for software execution, as well as a variety of example studies.

More Details

A performance-portable nonhydrostatic atmospheric dycore for the energy exascale earth system model running at cloud-resolving resolutions

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Bertagna, Luca; Guba, Oksana; Taylor, Mark A.; Foucar, James G.; Larkin, Jeff; Bradley, Andrew M.; Rajamanickam, Sivasankaran; Salinger, Andrew G.

We present an effort to port the nonhydrostatic atmosphere dynamical core of the Energy Exascale Earth System Model (E3SM) to efficiently run on a variety of architectures, including conventional CPU, many-core CPU, and GPU. We specifically target cloud-resolving resolutions of 3 km and 1 km. To express on-node parallelism we use the C++ library Kokkos, which allows us to achieve a performance portable code in a largely architecture-independent way. Our C++ implementation is at least as fast as the original Fortran implementation on IBM Power9 and Intel Knights Landing processors, proving that the code refactor did not compromise the efficiency on CPU architectures. On the other hand, when using the GPUs, our implementation is able to achieve 0.97 Simulated Years Per Day, running on the full Summit supercomputer. To the best of our knowledge, this is the most achieved to date by any global atmosphere dynamical core running at such resolutions.

More Details

Chronicles of astra: Challenges and lessons from the first petascale arm supercomputer

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Bays, Nathan R.; Younge, Andrew J.; Hammond, Simon; Bays, Nathan R.; Curry, Matthew; Aguilar, Michael J.; Hoekstra, Robert J.; Brightwell, Ronald B.

Arm processors have been explored in HPC for several years, however there has not yet been a demonstration of viability for supporting large-scale production workloads. In this paper, we offer a retrospective on the process of bringing up Astra, the first Petascale supercomputer based on 64-bit Arm processors, and validating its ability to run production HPC applications. Through this process several immature technology gaps were addressed, including software stack enablement, Linux bugs at scale, thermal management issues, power management capabilities, and advanced container support. From this experience, several lessons learned are formulated that contributed to the successful deployment of Astra. These insights can be helpful to accelerate deploying and maturing other first-seen HPC technologies. With Astra now supporting many users running a diverse set of production applications at multi-thousand node scales, we believe this constitutes strong supporting evidence that Arm is a viable technology for even the largest-scale supercomputer deployments.

More Details

Formulation, analysis and computation of an optimization-based local-to-nonlocal coupling method

Results in Applied Mathematics

Bochev, Pavel; D'Elia, Marta

In this paper, we present an optimization-based coupling method for local and nonlocal continuum models. Our approach couches the coupling of the models into a control problem where the states are the solutions of the nonlocal and local equations, the objective is to minimize their mismatch on the overlap of the local and nonlocal problem domains, and the virtual controls are the nonlocal volume constraint and the local boundary condition. We present the method in the context of Local-to-Nonlocal di usion coupling. Numerical examples illustrate the theoretical properties of the approach.

More Details

Method of information entropy for convergence assessment of molecular dynamics simulations

Journal of Applied Physics

Talaat, Khaled; Cowen, Benjamin; Anderoglu, Osman

The lack of a reliable method to evaluate the convergence of molecular dynamics simulations has contributed to discrepancies in different areas of molecular dynamics. In the present work, the method of information entropy is introduced to molecular dynamics for stationarity assessment. The Shannon information entropy formalism is used to monitor the convergence of the atom motion to a steady state in a continuous spatial domain and is also used to assess the stationarity of calculated multidimensional fields such as the temperature field in a discrete spatial domain. It is demonstrated in this work that monitoring the information entropy of the atom position matrix provides a clear indicator of reaching steady state in radiation damage simulations, non-equilibrium molecular dynamics thermal conductivity computations, and simulations of Poiseuille and Couette flow in nanochannels. A main advantage of the present technique is that it is non-local and relies on fundamental quantities available in all molecular dynamics simulations. Unlike monitoring average temperature, the technique is applicable to simulations that conserve total energy such as reverse non-equilibrium molecular dynamics thermal conductivity computations and to simulations where energy dissipates through a boundary as in radiation damage simulations. The method is applied to simulations of iron using the Tersoff/ZBL splined potential, silicon using the Stillinger-Weber potential, and to Lennard-Jones fluid. Its applicability to both solids and fluids shows that the technique has potential for generalization to other areas in molecular dynamics.

More Details

On differentiable local bounds preserving stabilization for Euler equations

Computer Methods in Applied Mechanics and Engineering

Shadid, John N.

This work presents the design of nonlinear stabilization techniques for the finite element discretization of Euler equations in both steady and transient form. Implicit time integration is used in the case of the transient form. A differentiable local bounds preserving method has been developed, which combines a Rusanov artificial diffusion operator and a differentiable shock detector. Nonlinear stabilization schemes are usually stiff and highly nonlinear. This issue is mitigated by the differentiability properties of the proposed method. Moreover, in order to further improve the nonlinear convergence, we also propose a continuation method for a subset of the stabilization parameters. The resulting method has been successfully applied to steady and transient problems with complex shock patterns. Numerical experiments show that it is able to provide sharp and well resolved shocks. The importance of the differentiability is assessed by comparing the new scheme with its non-differentiable counterpart. Numerical experiments suggest that, for up to moderate nonlinear tolerances, the method exhibits improved robustness and nonlinear convergence behavior for steady problems. In the case of transient problem, we also observe a reduction in the computational cost.

More Details
Results 1101–1150 of 9,998
Results 1101–1150 of 9,998
Top