Publications

Results 26–50 of 126

Search results

Jump to search filters

ECP Milestone Deliverable Memo 2.3.1.04.15

Wilke, Jeremiah J.

The DARMA many-task framework provides asynchronous communication and load balancing functionality. This functionality is embedded in standard , modern C++ through the use of the template wrapper classes similar to futures. DARMA codes previously could not interoperate with MPI. A new C++ interface and extended semantics now allow quiescence of DARMA kernels and transfer of data ownership back into MPI, allowing 1) isolated DARMA kernels to be inserted in a larger MPI code or 2) reuse of MPI libraries like solvers within a DARMA application.

More Details

ECP Milestone Memo for 2.3.1.04.16

Wilke, Jeremiah J.

The DARMA many-task framework provides asynchronous communication and load balancing functionality. This functionality is embedded in standard, modern C++ through the use of the template wrapper classes similar to futures. DARMA codes previously could not interoperate with Kokkos or OpenMP since each runtime assumed sole ownership of thread resources. The most recent version now allows for the use of Kokkos/OpenMP within DARMA tasks, allowing for better performance through thread-level parallelism or use of accelerators.

More Details

DARMA-Kokkos Data and Execution Interoperability WBS 2.3.1.04 Milestone 16 (ECP Milestone Report)

Wilke, Jeremiah J.

DARMA (Distributed Asynchronous Resilient Models for Applications) is a runtime library supporting the Sandia ATDM (Advanced Technology Development and Mitigation) program. The main application drivers fall within the ECP milestone 2.2.5.03 ADNN03-ASC ATDM SNL Application, which includes applications that require load balancing and asynchronous communication for high performance. The DARMA runtime infrastructure has been modified to be compatible with Kokkos/OpenMP parallelization within tasks, which is a critical requirement for high performance for the Sandia ATDM apps. DARMA development has occurred in parallel with a verification milestone for ATDM in FY18. For FY19, DARMA should impact ATDM by enabling dynamic load balancing and communication through only incremental changes to the existing verified MPI codes. DARMA can now support the intra-kernel thread parallelization in the parent MPI apps, allowing DARMA to be easily added without rewriting individual math kernels. The results presented here demonstrate the DARMA results for an MPI mini-app.

More Details

DARMA-MPI Interoperability WBS 2.3.1.04 Milestone 15 (ECP Milestone Report)

Wilke, Jeremiah J.

DARMA (Distributed Asynchronous Resilient Models for Applications) is a runtime library developed as part of the the Sandia ATDM (Advanced Technology Development and Mitigation) program. DARMA supports applications within 2.2.5.03 ADNN03-ASC ATDM SNL Application, which includes a number of applications featuring dynamic physics which requires load balancing and asynchronous communication for high performance. We have implemented a modern C++ programming model that can enable dynamic, asynchronous communication on top of existing data structures from a serial or MPI code. DARMA development has occurred in parallel with a verification milestone for ATDM in FY18. For FY19, DARMA will impact ATDM by enabling the performance benefits of a dynamic runtime through only incremental changes to existing verified MPI codes. The results presented here demonstrate the DARMA development process for an MPI mini-app, showing a 3-4x improvement in performance for a challenging problem relative to the parent MPI code and coming within 25 percent of the theoretically optimal performance achievable from a perfect, fine-grained load balancer for most cases.

More Details

Supercomputer in a Laptop: Distributed Application and Runtime Development via Architecture Simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Knight, Samuel K.; Kenny, Joseph P.; Wilke, Jeremiah J.

Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist. In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a “supercomputer in a laptop” environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.

More Details

Compiler-assisted source-to-source skeletonization of application models for system simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Wilke, Jeremiah J.; Kenny, Joseph P.; Knight, Samuel K.; Rumley, Sebastien

Performance modeling of networks through simulation requires application endpoint models that inject traffic into the simulation models. Endpoint models today for system-scale studies consist mainly of post-mortem trace replay, but these off-line simulations may lack flexibility and scalability. On-line simulations running so-called skeleton applications run reduced versions of an application that generate traffic that is the same or similar to the full application. These skeleton apps have advantages for flexibility and scalability, but they often must be custom written for the simulator itself. Auto-skeletonization of existing application source code via compiler tools would provide endpoint models with minimal development effort. These source-to-source transformations have been only narrowly explored. We introduce a pragma language and corresponding Clang-driven source-to-source compiler that performs auto-skeletonization based on provided pragma annotations. We describe the compiler toolchain, validate the generated skeletons, and show scalability of the generated simulation models beyond 100Â K endpoints for example MPI applications. Overall, we assert that our proposed auto-skeletonization approach and the flexible skeletons it produces can be an important tool in realizing balanced exascale interconnect designs.

More Details

Supercomputer in a Laptop: Distributed Application and Runtime Development via Architecture Simulation

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Knight, Samuel K.; Kenny, Joseph P.; Wilke, Jeremiah J.

Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist. In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a “supercomputer in a laptop” environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.

More Details

The pitfalls of provisioning exascale networks: A trace replay analysis for understanding communication performance

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Kenny, Joseph P.; Sargsyan, Khachik S.; Knight, Samuel K.; Michelogiannakis, George; Wilke, Jeremiah J.

Data movement is considered the main performance concern for exascale, including both on-node memory and off-node network communication. Indeed, many application traces show significant time spent in MPI calls, potentially indicating that faster networks must be provisioned for scalability. However, equating MPI times with network communication delays ignores synchronization delays and software overheads independent of network hardware. Using point-to-point protocol details, we explore the decomposition of MPI time into communication, synchronization and software stack components using architecture simulation. Detailed validation using Bayesian inference is used to identify the sensitivity of performance to specific latency/bandwidth parameters for different network protocols and to quantify associated uncertainties. The inference combined with trace replay shows that synchronization and MPI software stack overhead are at least as important as the network itself in determining time spent in communication routines.

More Details
Results 26–50 of 126
Results 26–50 of 126