A Communication- and Memory-Aware Model for Load Balancing Tasks
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Contact mechanics, or the modeling of the impenetrability of solid objects, is fundamental to computational solid mechanics (CSM) applications yet is oftentimes the most challenging in terms of computational efficiency and performance. These challenges arise from the irregularity and highly dynamic nature of contact simulation, particularly with algorithms designed for distributed memory architectures. First among these challenges is the inherent load imbalance when distributing contact load across compute nodes. This imbalance is highly problem dependent, and relates to the surface area of contact manifolds and the volume around them, rather than the distribution of the mesh over compute nodes, meaning the application load can vary drastically over different phases. The dynamic nature of contact problems motivates the use of distributed asynchronous many-tasking (AMT) frameworks to efficiently handle irregular workloads. In this paper, we present our work on distBVH, a distributed contact solution using the DARMA/vt library for asynchronous tasking that is also capable of running on-node Kokkos-based kernels. We explore how distBVH addresses the various challenges of CSM contact problems. We evaluate the use of many of DARMA/vt’s dynamic load balancers and demonstrate how our load balancing approach can provide significant performance improvements on various computational solid mechanics benchmarks. Additionally, we show how our approach can take advantage of DARMA/vt for tasking and efficient on-node kernels using Kokkos to scale over hundreds of processing elements.
This report presents our work to model the workloads of a linear electromagnetic application based on the method of moments in the frequency domain to effectively load balance the matrix assembly. This application is particularly challenging to load balance due to its lack of persistent iterative behavior, its operation under tight memory constraint (where the matrix may fill 80% of memory on each node), and the algorithmic complexity of the computational method. This report describes the first step in our work to apply an inspector-executor approach for load balancing workloads where key parameters are exposed during the inspector phase and a pre-trained model is applied to predict relative task weights for the load balancer.
The goal of this report is to provide insight to the development of vt-tv, a C++ HPC visualization tool designed for insightful analysis of load-balancing metrics in the DARMA toolkit. In particular, it delves into its modular data model and diverse usage scenarios, emphasizing adaptability and efficiency.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Cluster Computing, ICCC
This paper explores dynamic load balancing algorithms used by asynchronous many-task (AMT), or 'taskbased', programming models to optimize task placement for scientific applications with dynamic workload imbalances. AMT programming models use overdecomposition of the computational domain. Overdecompostion provides a natural mechanism for domain developers to expose concurrency and break their computational domain into pieces that can be remapped to different hardware. This paper explores fully distributed load balancing strategies that have shown great promise for exascalelevel computing but are challenging to theoretically reason about and implement effectively. We present a novel theoretical analysis of a gossip-based load balancing protocol and use it to build an efficient implementation with fast convergence rates and high load balancing quality. We demonstrate our algorithm in a nextgeneration plasma physics application (EMPIRE) that induces time-varying workload imbalance due to spatial non-uniformity in particle density across the domain. Our highly scalable, novel load balancing algorithm, achieves over a 3x speedup (particle work) compared to a bulk-synchronous MPI implementation without load balancing.
World Congress in Computational Mechanics and ECCOMAS Congress
Software development for high-performance scientific computing continues to evolve in response to increased parallelism and the advent of on-node accelerators, in particular GPUs. While these hardware advancements have the potential to significantly reduce turnaround times, they also present implementation and design challenges for engineering codes. We investigate the use of two strategies to mitigate these challenges: the Kokkos library for performance portability across disparate architectures, and the DARMA/vt library for asynchronous many-task scheduling. We investigate the application of Kokkos within the NimbleSM finite element code and the LAMÉ constitutive model library. We explore the performance of DARMA/vt applied to NimbleSM contact mechanics algorithms. Software engineering strategies are discussed, followed by performance analyses of relevant solid mechanics simulations which demonstrate the promise of Kokkos and DARMA/vt for accelerated engineering simulators.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.
Abstract not provided.
Abstract not provided.
The goal of this report is to provide a comprehensive status report of the research & development conducted in the context of the DARMA project by the end of the first quarter of fiscal year 2020. It follows in particular [LBS+19] and [PL19].
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The goal of this report is to illustrate the use of Sandia's Automatic Report Generator (ARG), when applied to an Electrostatic simulation case run with Sandia's EMPIRE code. It documents the results of a hackathon session that was held at the March 19-22 DOE Workshop Workflow and Hackathon that was held in Livermore, where the co-authors demonstrated ARG's flexibilty by extending it to several aspect of such simulation in less than a day's worth of work. The Explorator component of ARG automatically picks up the case's input deck, hereby determining the data components that the Generator and Assembler components are currently able to document: meta-data, input deck, mesh, and solution fields. The ARG is not yet capable of documenting the particles file created by the simulation, which will require further work.
In this report with discuss load-balancing research and results in the context of the DARMA/VT project.
We begin by presenting an overview of the general philosophy that is guiding the novel DARMA developments, followed by a brief reminder about the background of this project. We finally present the FY19 design requirements. As the Exascale era arises, DARMA is uniquely positioned at the forefront of asychronous many-task (AMT) research and development (R&D) to explore emerging programming model paradigms for next-generation HPC applications at Sandia, across NNSA labs, and beyond. The DARMA project explores how to fundamentally shift the expression(PM) and execution(EM)of massively concurrent HPC scientific algorithms to be more asynchronous, resilient to executional aberrations in heterogeneous/unpredictable environments, and data-dependency conscious—thereby enabling an intelligent, dynamic, and self-aware runtime to guide execution.
Abstract not provided.
Abstract not provided.