Publications Search

A Communication- and Memory-Aware Model for Load Balancing Tasks

Lifflander, Jonathan J.; Pebay, Philippe P.; Slattengren, Nicole L.; Pebay, Pierre L.; Pfeiffer, Robert A.; Kotulski, Joseph D.; Mcgovern, Sean T.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2024

OSTI

Modeling Workloads of a Linear Electromagnetic Code for Load Balancing Matrix Assembly

Lifflander, Jonathan J.; Pebay, Pierre L.; Mcgovern, Sean T.; Slattengren, Nicole L.

This report presents our work to model the workloads of a linear electromagnetic application based on the method of moments in the frequency domain to effectively load balance the matrix assembly. This application is particularly challenging to load balance due to its lack of persistent iterative behavior, its operation under tight memory constraint (where the matrix may fill 80% of memory on each node), and the algorithmic complexity of the computational method. This report describes the first step in our work to apply an inspector-executor approach for load balancing workloads where key parameters are exposed during the inspector phase and a pre-trained model is applied to predict relative task weights for the load balancer.

More Details

TYPE SAND Report YEAR 2023

DOI OSTI

LBAF-Viz: A New Application and Library to Visualize Computational Load and Communication Graphs

Morales, Nicolas; Pebay, Philippe P.; Wrobel, Marcin; Slattengren, Nicole L.; Lifflander, Jonathan J.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

Optimizing Distributed Load Balancing for Workloads with Time-Varying Imbalance

Lifflander, Jonathan J.; Slattengren, Nicole L.; Pebay, Philippe P.; Miller, Philip; Rizzi, Francesco; Bettencourt, Matthew T.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI

Optimizing Distributed Load Balancing for Workloads with Time-Varying Imbalance

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Lifflander, Jonathan J.; Slattengren, Nicole L.; Pebay, Philippe P.; Miller, Philip; Rizzi, Francesco; Bettencourt, Matthew T.

This paper explores dynamic load balancing algorithms used by asynchronous many-task (AMT), or 'taskbased', programming models to optimize task placement for scientific applications with dynamic workload imbalances. AMT programming models use overdecomposition of the computational domain. Overdecompostion provides a natural mechanism for domain developers to expose concurrency and break their computational domain into pieces that can be remapped to different hardware. This paper explores fully distributed load balancing strategies that have shown great promise for exascalelevel computing but are challenging to theoretically reason about and implement effectively. We present a novel theoretical analysis of a gossip-based load balancing protocol and use it to build an efficient implementation with fast convergence rates and high load balancing quality. We demonstrate our algorithm in a nextgeneration plasma physics application (EMPIRE) that induces time-varying workload imbalance due to spatial non-uniformity in particle density across the domain. Our highly scalable, novel load balancing algorithm, achieves over a 3x speedup (particle work) compared to a bulk-synchronous MPI implementation without load balancing.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI Scopus

Design and Implementation Techniques for an MPI-Oriented AMT Runtime

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Lifflander, Jonathan J.; Miller, Philip; Slattengren, Nicole L.; Morales, Nicolas; Stickney, Paul; Pebay, Philippe P.

We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Design and Implementation Techniques for an MPI-Oriented AMT Runtime

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Lifflander, Jonathan J.; Miller, Philip; Slattengren, Nicole L.; Morales, Nicolas; Stickney, Paul; Pebay, Philippe P.

We present the execution model of Virtual Transport (VT) a new, Asynchronous Many-Task (AMT) runtime system that provides unprecedented integration and interoperability with MPI. We have developed VT in conjunction with large production applications to provide a highly incremental, high-value path to AMT adoption in the dominant ecosystem of MPI applications, libraries, and developers. Our aim is that the'MPI+X' model of hybrid parallelism can smoothly extend to become'MPI+VT +X'. We illustrate a set of design and implementation techniques that have been useful in building VT. We believe that these ideas and the code embodying them will be useful to others building similar systems, and perhaps provide insight to how MPI might evolve to better support them. We motivate our approach with two applications that are adopting VT and have begun to benefit from increased asynchrony and dynamic load balancing.

More Details

TYPE Conference Paper YEAR 2020

OSTI Scopus

Composing Asynchrony Communication and Resilience

Paul, Sri R.; Hayashi, Akihiro-Ex; Slattengren, Nicole L.; Kolla, Hemanth; Bak, Seonmyeong-Ex; Whitlock, Matthew J.; Mayo, Jackson R.; Teranishi, Keita; Sarker, Vivek-Ex; Grossman, Max-Ex

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Integrating DARMA into EMPIRE for AMT-Based Particle Load Balancing

Slattengren, Nicole L.; Lifflander, Jonathan J.; Pebay, Philippe P.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Integrating DARMA into EMPIRE for AMT-Based Particle Load Balancing

Slattengren, Nicole L.; Lifflander, Jonathan J.; Pebay, Philippe P.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Dynamic Task-based Load Balancing using DARMA

Slattengren, Nicole L.; Lifflander, Jonathan J.; Pebay, Philippe P.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Enabling Resilience in Asynchronous Many-Task Programming Models

Paul, Sri R.; Hayahsi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Scalable Efficient Fault Tolerance in Asynchronous Many Task (AMT) Programming Models

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Mayo, Jackson R.; Paul, Sri R.; Hayashi, Akihiro; Sarker, Vivek; Bak, Seonmyeong

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Enabling Resilience in Asynchronous Many-Task Programming Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Paul, Sri R.; Hayashi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

DARMA-EMPIRE Integration and Performance Assessment – Interim Report

Lifflander, Jonathan J.; Bettencourt, Matthew T.; Slattengren, Nicole L.; Templet, Gary J.; Miller, Phil; Perrinel, Meriadeg; Rizzi, Francesco; Pebay, Philippe P.

We begin by presenting an overview of the general philosophy that is guiding the novel DARMA developments, followed by a brief reminder about the background of this project. We finally present the FY19 design requirements. As the Exascale era arises, DARMA is uniquely positioned at the forefront of asychronous many-task (AMT) research and development (R&D) to explore emerging programming model paradigms for next-generation HPC applications at Sandia, across NNSA labs, and beyond. The DARMA project explores how to fundamentally shift the expression(PM) and execution(EM)of massively concurrent HPC scientific algorithms to be more asynchronous, resilient to executional aberrations in heterogeneous/unpredictable environments, and data-dependency conscious—thereby enabling an intelligent, dynamic, and self-aware runtime to guide execution.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Analysis of Local Recovery Resilience Model for Asynchronous Many Task Parallel Programming Models

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

ASC ATDM Level 2 Milestone #6015: Asynchronous Many-Task Software Stack Demonstration

Bennett, Janine C.; Bettencourt, Matthew T.; Clay, Robert L.; Edwards, Harold C.; Glass, Micheal W.; Hollman, David S.; Kolla, Hemanth; Lifflander, Jonathan J.; Littlewood, David J.; Markosyan, Aram; Moore, Stan G.; Olivier, Stephen L.; Phipps, Eric T.; Rizzi, Francesco; Slattengren, Nicole L.; Sunderland, Daniel; Wilke, Jeremiah

This report is an outcome of the ASC ATDM Level 2 Milestone 6015: Asynchronous Many-Task Software Stack Demonstration. It comprises a summary and in depth analysis of DARMA and a DARMA-compliant Asynchronous Many-Task (AMT) runtime software stack. Herein performance and productivity of the over- all approach are assessed on benchmarks and proxy applications representative of the Sandia ATDM applications. As part of the effort to assess the perceived strengths and weaknesses of AMT models compared to more traditional methods, experiments were performed on ATS-1 (Advanced Technology Systems) test bed machines and Trinity. In addition to productivity and performance assessments, this report includes findings on the generality of DARMAs backend API as well as findings on interoperability with node- level and network-level system libraries. Together, this information provides a clear understanding of the strengths and limitations of the DARMA approach in the context of Sandias ATDM codes, to guide our future research and development in this area.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Tackling UQ in DARMA a Programming Model for Task-Based Execution at Extreme-Scale

Rizzi, Francesco; Phipps, Eric T.; Hollman, David S.; Lifflander, Jonathan J.; Wilke, Jeremiah; Markosyan, Aram; Kolla, Hemanth; Slattengren, Nicole L.; Teranishi, Keita; Stewart, James; Clay, Robert L.; Bennett, Janine C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Evaluating the Charm++ Runtimes Ability to Cope with Performance Heterogeneity

Gammel, Marc; Teranishi, Keita; Knight, Samuel; Sjaardema, Gregory D.; Kolla, Hemanth; Wilke, Jason; Slattengren, Nicole L.; Ferreira, Kurt; Bennett, Janine C.; Jain, Nikhil; Kale, Laxmikant

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Toward Resilient Task Parallel PDE Solvers

Teranishi, Keita; Gamell, Marc; Slattengren, Nicole L.; Parashar, Manish

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Status of the DARMA Asynchronous Many Task Abstraction Layer

Bennett, Janine C.; Lifflander, Jonathan J.; Hollman, David S.; Wilke, Jeremiah; Kolla, Hemanth; Markosyan, Aram; Slattengren, Nicole L.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Metaprogramming-Enabled Parallel Execution of Apparently Sequential C++ Code

Hollman, David S.; Bennett, Janine C.; Kolla, Hemanth; Lifflander, Jonathan J.; Wilke, Jeremiah; Slattengren, Nicole L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

Metaprogramming-Enabled Parallel Execution of Apparently Sequential C++ Code

Wilke, Jeremiah; Hollman, David S.; Bennett, Janine C.; Lifflander, Jonathan J.; Kolla, Hemanth; Slattengren, Nicole L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

DARMA 0.3.0-alpha Specification

Wilke, Jeremiah; Hollman, David S.; Slattengren, Nicole L.; Lifflander, Jonathan; Kolla, Hemanth; Rizzi, Francesco; Teranishi, Keita; Bennett, Janine C.

In this document, we provide the specifications for DARMA (Distributed Asynchronous Resilient Models and Applications), a co-design research vehicle for asynchronous many-task (AMT) programming models that serves to: 1) insulate applications from runtime system and hardware idiosyncrasies, 2) improve AMT runtime programmability by co-designing an application programmer interface (API) directly with application developers, 3) synthesize application co-design activities into meaningful requirements for runtime systems, and 4) facilitate AMT design space characterization and definition, accelerating the development of AMT best practices.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Publications

Search results