A Communication- and Memory-Aware Model for Load Balancing Tasks
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Contact mechanics, or the modeling of the impenetrability of solid objects, is fundamental to computational solid mechanics (CSM) applications yet is oftentimes the most challenging in terms of computational efficiency and performance. These challenges arise from the irregularity and highly dynamic nature of contact simulation, particularly with algorithms designed for distributed memory architectures. First among these challenges is the inherent load imbalance when distributing contact load across compute nodes. This imbalance is highly problem dependent, and relates to the surface area of contact manifolds and the volume around them, rather than the distribution of the mesh over compute nodes, meaning the application load can vary drastically over different phases. The dynamic nature of contact problems motivates the use of distributed asynchronous many-tasking (AMT) frameworks to efficiently handle irregular workloads. In this paper, we present our work on distBVH, a distributed contact solution using the DARMA/vt library for asynchronous tasking that is also capable of running on-node Kokkos-based kernels. We explore how distBVH addresses the various challenges of CSM contact problems. We evaluate the use of many of DARMA/vt’s dynamic load balancers and demonstrate how our load balancing approach can provide significant performance improvements on various computational solid mechanics benchmarks. Additionally, we show how our approach can take advantage of DARMA/vt for tasking and efficient on-node kernels using Kokkos to scale over hundreds of processing elements.
The goal of this report is to provide insight to the development of vt-tv, a C++ HPC visualization tool designed for insightful analysis of load-balancing metrics in the DARMA toolkit. In particular, it delves into its modular data model and diverse usage scenarios, emphasizing adaptability and efficiency.
Abstract not provided.
Abstract not provided.
Proceedings - IEEE International Conference on Cluster Computing, ICCC
This paper explores dynamic load balancing algorithms used by asynchronous many-task (AMT), or 'taskbased', programming models to optimize task placement for scientific applications with dynamic workload imbalances. AMT programming models use overdecomposition of the computational domain. Overdecompostion provides a natural mechanism for domain developers to expose concurrency and break their computational domain into pieces that can be remapped to different hardware. This paper explores fully distributed load balancing strategies that have shown great promise for exascalelevel computing but are challenging to theoretically reason about and implement effectively. We present a novel theoretical analysis of a gossip-based load balancing protocol and use it to build an efficient implementation with fast convergence rates and high load balancing quality. We demonstrate our algorithm in a nextgeneration plasma physics application (EMPIRE) that induces time-varying workload imbalance due to spatial non-uniformity in particle density across the domain. Our highly scalable, novel load balancing algorithm, achieves over a 3x speedup (particle work) compared to a bulk-synchronous MPI implementation without load balancing.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.
Abstract not provided.
The goal of this report is to provide a comprehensive status report of the research & development conducted in the context of the DARMA project by the end of the first quarter of fiscal year 2020. It follows in particular [LBS+19] and [PL19].
Credibility of end-to-end CompSim (Computational Simulation) models and their agile execution requires an expressive framework to describe, communicate and execute complex computational tool chains representing the model. All stakeholders from system engineering and customers through model developers and V&V partners need views and functionalities of the workflow representing the model in a manner that is natural to their discipline. In the milestone and in this report we define workflow as a network of computation simulation activities executed autonomously on a distributed set of computational platforms. The FY19 ASC L2 Milestone (6802) for the Integrated Workflow (IWF) project was designed to integrate and improve existing capabilities or develop new functionalities to provide a wide range of stakeholders a coherent and intuitive platform capable of defining and executing CompSim modeling from analysis workflow definition to complex ensemble calculations. The main goal of the milestone was to advance the integrated workflow capabilities to support the weapon system analysts with a production deployment in FY20. Ensemble calculations supporting program decisions include sensitivity analysis, optimization and uncertainty quantification. The goal of the L2 milestone aligned with the ultimate goal of the IWF project is to foster cultural and technical shift toward and integrated CompSim capability based on automated workflows. Specific deliverables were defined in five broad categories: 1) Infrastructure, including development of distributed-computing workflow capability, 2) integration of Dakota (Sandia's sensitivity, optimization and UQ engine) with SAW (Sandia Analysis Workbench), 3) ARG (Automatic Report Generator introspecting analysis artifacts and generating human-readable extensible and archivable reports), 4) Libraries and Repositories aiding capability reuse, and 5) Exemplars to support training, capturing best practices and stress testing of the platform. A set of exemplars was defined to represent typical weapon system qualification CompSim projects. Analyzing the required capabilities and using the findings to plan implementation of required capabilities ensured optimal allocation of development resources focused on production deployment after the L2 is completed. It was recognized early that the end-to-end modeling applications pose a considerable number of diverse risks, and a formal risk tracking process was implemented. The project leveraged products, capabilities and development tasks of IWF partners. SAW, Dakota, Cubit, Sierra, Slycat, and NGA (NexGen Analytics, a small business) contributed to the integrated platform developed during this milestone effort. New products delivered include: a) NGW (Next Generation Workflow) for robust workflow definition and execution, b) Dakota wizards, editor and results visualization, and c) the automatic report generator ARG. User engagement was initiated early in the development process eliciting concrete requirements and actionable feedback to assure that the integrated CompSim capability will have high user acceptance and impact. The current integrated capabilities have been demonstrated and are continually being tested by a set of exemplars ranging from training scenarios to computationally demanding uncertainty analyses. The integrated workflow platform has been deployed on both SRN (Sandia Restricted Network) and SCN (Sandia Classified Network). Computational platforms where the system has been demonstrated span from Windows (Creo the CAD platform chosen by Sandia) to Trinity HPC (Sierra and CTH solvers). Follow up work will focus on deployment at SNL and other sites in the nuclear enterprise (LLNL, KCNSC), training and consulting support to democratize the analysis agility, process health and knowledge management benefits the NGW platform provides.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The goal of this report is to illustrate the use of Sandia's Automatic Report Generator (ARG), when applied to an Electrostatic simulation case run with Sandia's EMPIRE code. It documents the results of a hackathon session that was held at the March 19-22 DOE Workshop Workflow and Hackathon that was held in Livermore, where the co-authors demonstrated ARG's flexibilty by extending it to several aspect of such simulation in less than a day's worth of work. The Explorator component of ARG automatically picks up the case's input deck, hereby determining the data components that the Generator and Assembler components are currently able to document: meta-data, input deck, mesh, and solution fields. The ARG is not yet capable of documenting the particles file created by the simulation, which will require further work.
In this report with discuss load-balancing research and results in the context of the DARMA/VT project.
We begin by presenting an overview of the general philosophy that is guiding the novel DARMA developments, followed by a brief reminder about the background of this project. We finally present the FY19 design requirements. As the Exascale era arises, DARMA is uniquely positioned at the forefront of asychronous many-task (AMT) research and development (R&D) to explore emerging programming model paradigms for next-generation HPC applications at Sandia, across NNSA labs, and beyond. The DARMA project explores how to fundamentally shift the expression(PM) and execution(EM)of massively concurrent HPC scientific algorithms to be more asynchronous, resilient to executional aberrations in heterogeneous/unpredictable environments, and data-dependency conscious—thereby enabling an intelligent, dynamic, and self-aware runtime to guide execution.
This report is a sequel to [PC18], where we provided the detailed installation and testing instructions of Sandia's currently-being-developed Automatic Report Generator (ARG), for both Linux and macOS target platforms. In the current report, we extend these instructions to the case of Windows systems.
Abstract not provided.
Abstract not provided.
Abstract not provided.
In this report we propose a new, extensible and flexible methodology to describe the structure of documents for the Automatic Report Generator (ARG) currently being developed at Sandia.
This report is a sequel to [PB16], in which we provided a first progress report on research and development towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system. This earlier work included a prototype implementation of a proposed solution, using a proxy mini-application as a surrogate for a full-scale scientific simulation code. The first scalability studies were conducted with the above on modestly-sized experimental clusters. In contrast, in the current work we have integrated our in situ analysis engines with a full-size scientific application (S3D, using the Legion-SPMD model), and have conducted nu- merical tests on the largest computational platform currently available for DOE science ap- plications. We also provide details regarding the design and development of a light-weight asynchronous collectives library. We describe how this library is utilized within our SPMD- Legion S3D workflow, and compare the data aggregation technique deployed herein to the approach taken within our previous work.
As we look ahead to next generation high performance computing platforms, the placement and movement of data is becoming the key-limiting factor on both performance and energy efficiency. Furthermore, the increased quantities of data the systems are capable of generating, in conjunction with the insufficient rate of improvements in the supporting I/0 infrastructure, is forcing applications away from the off-line post-processing of data towards techniques based on in ,situ analysis and visualization. Together, these challenges are shaping how we will both design and develop effective, performant and energy-efficient software. In particular, the challenges highlight the need for data and data-centric operations to be fundamental in the reasoning about, and optimization of, scientific workflows on extreme-scale architectures.
Computational Statistics
Formulas for incremental or parallel computation of second order central moments have long been known, and recent extensions of these formulas to univariate and multivariate moments of arbitrary order have been developed. Such formulas are of key importance in scenarios where incremental results are required and in parallel and distributed systems where communication costs are high. We survey these recent results, and improve them with arbitrary-order, numerically stable one-pass formulas which we further extend with weighted and compound variants. We also develop a generalized correction factor for standard two-pass algorithms that enables the maintenance of accuracy over nearly the full representable range of the input, avoiding the need for extended-precision arithmetic. We then empirically examine algorithm correctness for pairwise update formulas up to order four as well as condition number and relative error bounds for eight different central moment formulas, each up to degree six, to address the trade-offs between numerical accuracy and speed of the various algorithms. Finally, we demonstrate the use of the most elaborate among the above mentioned formulas, with the utilization of the compound moments for a practical large-scale scientific application.