Publications Search

Measuring Thread Timing to Assess the Feasibility of Early-Bird Message Delivery Across Systems and Scales

Concurrency and Computation: Practice and Experience

Schonbein, Whit; Dosanjh, Matthew G.F.; Levy, Scott; Marts, W.P.; Bridges, Patrick G.

Early-bird communication is a communication/computation overlap technique that leverages fine-grained communication to improve application run-time. Communication is divided such that each individual thread can initiate transmission of its portion of the data upon completion rather than waiting for a dedicated communication phase. The benefit of early-bird communication depends on the completion timing of the individual threads: On the one hand, if all threads are complete at nearly the same time, the overheads of sending multiple messages will accumulate, leading to performance that is worse than if a single message had been sent. On the other hand, if thread completions are spread out in time, those that complete earlier can send data while others continue working, leading to performance that is better than if a single message had been sent. The challenge is that the completion times are currently unknown and can vary based on application, problem size, system software, and underlying hardware. In this paper, we address this lacuna by measuring and evaluating the potential overlap afforded by early-bird communication for a selection of proxy applications. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. Each application is run on three systems with distinct CPU architectures and strong scales across three run sizes. To characterize the behavior of these workloads, we study the trends of thread timings at both a macro level, across all threads across all runs of an application, and a micro level, that is, within a single process of a single run. We observe that our tested applications exhibit significantly different thread arrival distributions. The machine used had a significant impact, with the window of potential overlap varying by as much as an order of magnitude.

More Details

TYPE Journal Article YEAR 2025

DOI OSTI Scopus

Implementing One Sided Partitioned Communication in Open MPI

Dosanjh, Matthew G.F.

This report introduces partitioned communication, a new MPI 4.0 interface that enables early bird communication by overlapping communication and computation. By partitioning messages into smaller sub-messages, MPI can start partial data transfers early. Performance studies show that the RMA implementation outperforms the Persistent implementation, despite some constraints. This report details a new opt-in RMA implementation, offering a high-performance option for partitioned communication that imposes some additional limitations.

More Details

TYPE SAND Report YEAR 2024

DOI OSTI

CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication

Marts, William P.; Schonbein, Whit; Dosanjh, Matthew G.F.; Levy, Scott; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2024

DOI OSTI

Partitioned communication and the future of application design

Dosanjh, Matthew G.F.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2024

DOI OSTI

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

ACM International Conference Proceeding Series

Marts, William P.; Dosanjh, Matthew G.F.; Schonbein, Whit; Levy, Scott; Bridges, Patrick G.

Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication.

More Details

TYPE Conference Proceeding YEAR 2023

DOI OSTI Scopus

Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained Communication

ACM International Conference Proceeding Series

Schonbein, Whit; Levy, Scott; Dosanjh, Matthew G.F.; Marts, William P.; Reid, Elizabeth; Grant, Ryan E.

Traditional point-to-point communication sends data only after the entirety of the data is available. This includes situations where multiple actors (e.g., threads) contribute to the send buffer. As a result, cases where the completion times of these actors are widely distributed may be lost opportunities for optimization because data ready to be sent is waiting to be transmitted. Fine-grained communication exposes these opportunities by allowing buffers to be divided into elements that can then be sent independently (see e.g., Partitioned Communication in Message Passing Interface v4.0). While some research has been directed at exploring the utility of such 'early-bird' transmission, the overall search space for finding the best performing actor completion timings and element counts is large. In this work, we present an abstract model of fine-grained communication based on the LogGP model and a complementary benchmark. We use the model to explore actor completion timing scenarios and identify trends in communication behavior based on factors such as overall message size and delay between actor completions. We evaluate the benchmarks on three systems utilizing distinct network technologies and show that: (i) smaller numbers of elements are able to exploit most of the benefit of early-bird communication, (ii) performance benefit will depend non-trivially on application behavior, and (iii) benefits are highly network-dependent.

More Details

TYPE Conference Presentation YEAR 2023

DOI OSTI Scopus

SNL ATDM Software Ecosystem Then and Now: Operating Systems and On-Node Runtime

Olivier, Stephen L.; Brightwell, Ronald B.; Dosanjh, Matthew G.F.; Ferreira, Kurt B.; Levy, Scott; Bays, Nathan R.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2022

OSTI

Effective Multithread Communication for Next Generation Scientific Applications

Dosanjh, Matthew G.F.; Marts, William P.; Ciesko, Jan

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

How To Leverage New MPI Features for Exascale Applications

Dosanjh, Matthew G.F.; Marts, William P.; Ciesko, Jan; Prichard, Howard

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

MPI Sessions, Persistent Collectives, and Partitioned Communication

Dosanjh, Matthew G.F.; Prichard, Howard

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI

SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime

Olivier, Stephen L.; Brightwell, Ronald B.; Dosanjh, Matthew G.F.; Ferreira, Kurt B.; Levy, Scott; Bays, Nathan R.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2022

OSTI

MiniMod: A Modular Miniapplication Benchmarking Framework for HPC

Marts, William P.; Dosanjh, Matthew G.F.; Schonbein, Whit; Levy, Scott; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Design of a Portable Implementation of Partitioned Point-to-Point Communication Primitives

Worley, Andrew; Schafer, Derek; Bangalore, Purushotham V.; Dosanjh, Matthew G.F.; Grant, Ryan; Skjellum, Anthony; Ghafoor, Ghafoor

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Partitioned Collective Communication

Proceedings of ExaMPI 2021: Workshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Holmes, Daniel J.; Skjellum, Anthony; Jaeger, Julien; Grant, Ryan E.; Schafer, Derek; Bangalore, Purushotham V.; Dosanjh, Matthew G.F.; Bienz, Amanda

Partitioned point-To-point communication and persistent collective communication were both recently standardized in MPI-4.0. Each offers performance and scalability advantages over MPI-3.1-based communication when planned transfers are feasible in an MPI application. Their merger into a generalized, persistent collective communication with partitions is a logical next step, with significant advantages for performance portability. Non-Trivial decisions about the syntax and semantics of such operations need to be addressed, including scope of knowledge of partitioning choices by members of the communicator's group(s). This paper introduces and motivates proposed interfaces for partitioned collective communication. Partitioned collectives will be particularly useful for multithreaded, accelerator-offloaded, and/or hardware-collective-enhanced MPI implementations driving suitable applications, as well as for pipelined collective communication (e.g., partitioned allreduce) with single consumers and producers per MPI process. These operations also provide load imbalance mitigation. Halo exchange codes arising from regular and irregular grid/mesh applications are a key candidate class of applications for this functionality. Generalizations of lightweight notification procedures MPI-Parrived and MPI-Pready are considered. Generalization of MPIX-Pbuf-prepare, a procedure proposed for MPI-4.1 for point-To-point partitioned communication, are also considered, shown in context of supporting ready-mode send semantics for the operations. The option of providing local and incomplete modes for initialization procedures is mentioned (which could also apply to persistent collective operations); these semantics interact with the MPIX-Pbuf-prepare concept and the progress rule. Last, future work is outlined, indicating prerequisites for formal consideration for the MPI-5 standard.

More Details

TYPE Conference Paper YEAR 2021

OSTI Scopus

RVMA: Remote Virtual Memory Access

Grant, Ryan; Levenhagen, Michael; Dosanjh, Matthew G.F.; Widener, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, Whit; Levy, Scott; Marts, William P.; Dosanjh, Matthew G.F.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI Scopus

Low-cost MPI Multithreaded Message Matching Benchmarking

Schonbein, Whit; Grant, Ryan; Levy, Scott; Dosanjh, Matthew G.F.; Marts, William P.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

MPI Partitioned Communication

Grant, Ryan; Dosanjh, Matthew G.F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Tail queues: A multi-threaded matching architecture

Concurrency and Computation: Practice and Experience

Dosanjh, Matthew G.F.; Grant, Ryan; Schonbein, Whit; Bridges, Patrick G.

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI DOI Scopus OSTI Scopus

INCA: In-Network Compute Assistance

Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.F.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

The Case for Modular Generalizable Proxy Applications for Systems Software Research

Marts, William P.; Dosanjh, Matthew G.F.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Receive-Side Partitioned Communication

Dosanjh, Matthew G.F.; Grant, Ryan

This report describes the implementation and experimentation of receive-side partitioned communication for the Message Passing Interface (MPI).

More Details

TYPE SAND Report YEAR 2019

DOI OSTI

INCA: In-Network Compute Assistance

Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.F.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

MPI Tag Matching Performance on ConnectX and ARM

Marts, William P.; Dosanjh, Matthew G.F.; Schonbein, Whit; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Fuzzy matching: Hardware accelerated MPI communication middleware

Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Dosanjh, Matthew G.F.; Schonbein, Whit; Grant, Ryan; Bridges, Patrick G.; Gazimirsaeed, S.M.; Afsahi, Ahmad

Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI-THREAD-MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Publications

Search results