Publications

10 Results

Search results

Jump to search filters

Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained Communication

ACM International Conference Proceeding Series

Schonbein, William W.; Levy, Scott L.; Dosanjh, Matthew D.; Marts, William P.; Reid, Elizabeth; Grant, Ryan E.

Traditional point-to-point communication sends data only after the entirety of the data is available. This includes situations where multiple actors (e.g., threads) contribute to the send buffer. As a result, cases where the completion times of these actors are widely distributed may be lost opportunities for optimization because data ready to be sent is waiting to be transmitted. Fine-grained communication exposes these opportunities by allowing buffers to be divided into elements that can then be sent independently (see e.g., Partitioned Communication in Message Passing Interface v4.0). While some research has been directed at exploring the utility of such 'early-bird' transmission, the overall search space for finding the best performing actor completion timings and element counts is large. In this work, we present an abstract model of fine-grained communication based on the LogGP model and a complementary benchmark. We use the model to explore actor completion timing scenarios and identify trends in communication behavior based on factors such as overall message size and delay between actor completions. We evaluate the benchmarks on three systems utilizing distinct network technologies and show that: (i) smaller numbers of elements are able to exploit most of the benefit of early-bird communication, (ii) performance benefit will depend non-trivially on application behavior, and (iii) benefits are highly network-dependent.

More Details

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

ACM International Conference Proceeding Series

Marts, William P.; Dosanjh, Matthew D.; Schonbein, William W.; Levy, Scott L.; Bridges, Patrick G.

Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication.

More Details

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, William W.; Levy, Scott L.; Marts, William P.; Dosanjh, Matthew D.; Grant, Ryan E.

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details
10 Results
10 Results