Publications Search

MiniMod: A Modular Miniapplication Benchmarking Framework for HPC

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Levy, Scott L.N.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Design of a Portable Implementation of Partitioned Point-to-Point Communication Primitives

Worley, Andrew; Schafer, Derek; Bangalore, Purushotham V.; Dosanjh, Matthew G.; Grant, Ryan; Skjellum, Anthony; Ghafoor, Ghafoor

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Partitioned Collective Communication

Proceedings of ExaMPI 2021: Workshop on Exascale MPI, Held in conjunction with SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis

Holmes, Daniel J.; Skjellum, Anthony; Jaeger, Julien; Grant, Ryan E.; Schafer, Derek; Bangalore, Purushotham V.; Dosanjh, Matthew G.; Bienz, Amanda

Partitioned point-To-point communication and persistent collective communication were both recently standardized in MPI-4.0. Each offers performance and scalability advantages over MPI-3.1-based communication when planned transfers are feasible in an MPI application. Their merger into a generalized, persistent collective communication with partitions is a logical next step, with significant advantages for performance portability. Non-Trivial decisions about the syntax and semantics of such operations need to be addressed, including scope of knowledge of partitioning choices by members of the communicator's group(s). This paper introduces and motivates proposed interfaces for partitioned collective communication. Partitioned collectives will be particularly useful for multithreaded, accelerator-offloaded, and/or hardware-collective-enhanced MPI implementations driving suitable applications, as well as for pipelined collective communication (e.g., partitioned allreduce) with single consumers and producers per MPI process. These operations also provide load imbalance mitigation. Halo exchange codes arising from regular and irregular grid/mesh applications are a key candidate class of applications for this functionality. Generalizations of lightweight notification procedures MPI-Parrived and MPI-Pready are considered. Generalization of MPIX-Pbuf-prepare, a procedure proposed for MPI-4.1 for point-To-point partitioned communication, are also considered, shown in context of supporting ready-mode send semantics for the operations. The option of providing local and incomplete modes for initialization procedures is mentioned (which could also apply to persistent collective operations); these semantics interact with the MPIX-Pbuf-prepare concept and the progress rule. Last, future work is outlined, indicating prerequisites for formal consideration for the MPI-5 standard.

More Details

TYPE Conference Paper YEAR 2021

OSTI Scopus

RVMA: Remote Virtual Memory Access

Grant, Ryan; Levenhagen, Michael; Dosanjh, Matthew G.; Widener, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, William W.; Levy, Scott L.N.; Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI Scopus

Low-cost MPI Multithreaded Message Matching Benchmarking

Schonbein, William W.; Grant, Ryan; Levy, Scott L.N.; Dosanjh, Matthew G.; Marts, William P.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

MPI Partitioned Communication

Grant, Ryan; Dosanjh, Matthew G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Tail queues: A multi-threaded matching architecture

Concurrency and Computation: Practice and Experience

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick G.

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

The Case for Modular Generalizable Proxy Applications for Systems Software Research

Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

MPI Tag Matching Performance on ConnectX and ARM

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Fuzzy matching: Hardware accelerated MPI communication middleware

Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick G.; Gazimirsaeed, S.M.; Afsahi, Ahmad

Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI-THREAD-MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Using simulation to examine the effect of MPI message matching costs on application performance

Parallel Computing

Levy, Scott L.N.; Ferreira, Kurt; Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Fuzzy Matching: Accelerating MPI Matching with Vector Comparisons Based on Partial Truth

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick; Gazimirsaeed, Mahdieh; Asafi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.; Grant, Ryan; Hjelm, Nathan; Levy, Scott L.N.; Schonbein, William W.

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

TYPE Book YEAR 2019

OSTI Scopus

Finepoints: Partitioned multithreaded MPI communication

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Grant, Ryan; Dosanjh, Matthew G.; Levenhagen, Michael; Brightwell, Ronald B.; Skjellum, Anthony

The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI. In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future. To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate upÂ to a 12 × reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing upÂ to 26.1% improvement in communication time and upÂ to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

The case for semi-permanent cache occupancy

ACM International Conference Proceeding Series

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

The Upcoming Storm: The Implications of Increasing Core Count on Scalable System Software

Dosanjh, Matthew G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Measuring Multithreaded Message Matching Misery

Schonbein, William W.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

The Case for Semi-Permanent Cache Occupancy

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Improving MPI Multi-threaded RMA Performance

Hjelm, Nathan; Dosanjh, Matthew G.; Groves, Taylor; Grant, Ryan; Brightwell, Ronald B.; Bridges, Patrick; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Measuring Multithreaded Message Matching Misery

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Schonbein, William W.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.

MPI usage patterns are changing as applications move towards fully-multithreaded runtimes. However, the impact of these patterns on MPI message matching is not well-studied. In particular, MPI’s mechanic for receiver-side data placement, message matching, can be impacted by increased message volume and nondeterminism incurred by multithreading. While there has been significant developer interest and work to provide an efficient MPI interface for multithreaded access, there has not been a study showing how these patterns affect messaging patterns and matching behavior. In this paper, we present a framework for studying the effects of multithreading on MPI message matching. This framework allows us to explore the implications of different common communication patterns and thread-level decompositions. We present a study of these impacts on the architecture of two of the Top 10 supercomputers (NERSC’s Cori and LANL’s Trinity). This data provides a baseline to evaluate reasonable matching engine queue lengths, search depths, and queue drain times under the multithreaded model. Furthermore, the study highlights surprising results on the challenge posed by message matching for multithreaded application performance.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus