Publications Search

CMB: A Configurable Messaging Benchmark to Explore Fine-Grained Communication

Marts, William P.; Schonbein, William W.; Dosanjh, Matthew G.; Levy, Scott L.N.; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2024

DOI OSTI

Leveraging High-Performance Data Transfer to Offload Data Management Tasks to SmartNICs

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Levy, Scott L.N.; Schonbein, William W.; Ulmer, Craig

Network interface controllers (NICs) with general-purpose compute capabilities ('SmartNICs') present an opportunity for reducing host application overheads by offloading non-critical tasks to the NIC. In addition to moving computation, offloading requires that associated data is also transferred to the NIC. To meet this need, we introduce a high-performance, general-purpose data movement service that facilitates the of-floading of tasks to SmartNICs: The SmartNIC Data Movement Service (SDMS). SDMS provides near-line-rate transfer band-widths between the host and NIC. Moreover, SDMS's In-transit Data Placement (IDP) feature can reduce (or even eliminate) the cost of serializing data on the NIC by performing the necessary data formatting during the transfer. To illustrate these capabilities, we provide an in-depth case study using SDMS to offload data management operations related to Apache Arrow, a popular data format standard. For single-column tables, SDMS can achieve more than 87% of baseline throughput for data buffers that are 128 KiB or larger (and more than 95% of baseline throughput for buffers that are 1 MiB or larger) while also nearly eliminating the host and SmartNIC overhead associated with Arrow operations.

More Details

TYPE Conference Paper YEAR 2024

OSTI Scopus

“Smarter” NICs for faster algorithms [Slides]

Karamati, Sara; Young, Jeffrey L.; Vuduc, Rich; Hemmert, Karl S.; Schonbein, William W.; Siefert, Christopher; Levy, Scott L.N.; Hughes, Clayton

The basic building block of a distributed-memory cluster or supercomputer is a node. Each node includes a host, which is a processor (xPU) + memory hierarchy. The host can communicate with other hosts via its NIC (network interface controller). A network connects the nodes. The nodes may be arranged in some topology, which determines the network’s carrying capacity and cost.

More Details

TYPE Other Report YEAR 2023

DOI OSTI

Measuring Thread Timing to Assess the Feasibility of Early-bird Message Delivery

ACM International Conference Proceeding Series

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Levy, Scott L.N.; Bridges, Patrick G.

Early-bird communication is a communication/computation overlap technique that combines fine-grained communication with partitioned communication to improve application run-time. Communication is divided among the compute threads such that each individual thread can initiate transmission of its portion of the data as soon as it is complete rather than waiting for all of the threads. However, the benefit of early-bird communication depends on the completion timing of the individual threads. In this paper, we measure and evaluate the potential overlap, the idle time each thread experiences between finishing their computation and the final thread finishing. These measurements help us understand whether a given application could benefit from early-bird communication. We present our technique for gathering this data and evaluate data collected from three proxy applications: MiniFE, MiniMD, and MiniQMC. To characterize the behavior of these workloads, we study the thread timings at both a macro level, i.e., across all threads across all runs of an application, and a micro level, i.e., within a single process of a single run. We observe that these applications exhibit significantly different behavior. While MiniFE and MiniQMC appear to be well-suited for early-bird communication because of their wider thread distribution and more frequent laggard threads, the behavior of MiniMD may limit its ability to leverage early-bird communication.

More Details

TYPE Conference Proceeding YEAR 2023

DOI OSTI Scopus

Modeling and Benchmarking the Potential Benefit of Early-Bird Transmission in Fine-Grained Communication

ACM International Conference Proceeding Series

Schonbein, William W.; Levy, Scott L.N.; Dosanjh, Matthew G.; Marts, William P.; Reid, Elizabeth; Grant, Ryan E.

Traditional point-to-point communication sends data only after the entirety of the data is available. This includes situations where multiple actors (e.g., threads) contribute to the send buffer. As a result, cases where the completion times of these actors are widely distributed may be lost opportunities for optimization because data ready to be sent is waiting to be transmitted. Fine-grained communication exposes these opportunities by allowing buffers to be divided into elements that can then be sent independently (see e.g., Partitioned Communication in Message Passing Interface v4.0). While some research has been directed at exploring the utility of such 'early-bird' transmission, the overall search space for finding the best performing actor completion timings and element counts is large. In this work, we present an abstract model of fine-grained communication based on the LogGP model and a complementary benchmark. We use the model to explore actor completion timing scenarios and identify trends in communication behavior based on factors such as overall message size and delay between actor completions. We evaluate the benchmarks on three systems utilizing distinct network technologies and show that: (i) smaller numbers of elements are able to exploit most of the benefit of early-bird communication, (ii) performance benefit will depend non-trivially on application behavior, and (iii) benefits are highly network-dependent.

More Details

TYPE Conference Presentation YEAR 2023

DOI OSTI Scopus

The Portals 4.3 Network Programming Interface

Schonbein, William W.; Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan E.; Hemmert, Karl S.; Foulk, James W.; Underwood, Keith; Riesen, Rolf; Hoefler, Torsten; Barbe, Mathieu; Suraty Filho, Luiz H.; Ratchov, Alexandre; Maccabe, Arthur B.

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2022

DOI OSTI

'Smarter' NICs for faster molecular dynamics: a case study

Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022

Karamati, Sara; Hughes, Clayton; Hemmert, Karl S.; Grant, Ryan E.; Schonbein, William W.; Levy, Scott L.N.; Conte, Thomas M.; Young, Jeffrey; Buduc, Richard W.

This work evaluates the benefits of using a 'smart' network interface card (SmartNIC) as a compute accelerator for the example of the MiniMD molecular dynamics proxy application. The accelerator is NVIDIA's BlueField-2 card, which includes an 8-core Arm processor along with a small amount of DRAM and storage. We test the networking and data movement performance of these cards compared to a standard Intel server host using microbenchmarks and MiniMD. In MiniMD, we identify two distinct classes of computation, namely core computation and maintenance computation, which are executed in sequence. We restructure the algorithm and code to weaken this dependence and increase task parallelism, thereby making it possible to increase utilization of the BlueField-2 concurrently with the host. We evaluate our implementation on a cluster consisting of 16 dual-socket Intel Broadwell host nodes with one BlueField-2 per host-node. Our results show that while the overall compute performance of BlueField-2 is limited, using them with a modified MiniMD algorithm allows for up to 20% speedup over the host CPU baseline with no loss in simulation accuracy.

More Details

TYPE Conference Proceeding YEAR 2022

DOI OSTI Scopus

From nave to smart: leveraging offloaded capabilities to enable intelligent NICs

Schonbein, William W.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

MiniMod: A Modular Miniapplication Benchmarking Framework for HPC

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Levy, Scott L.N.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

Co-design of System Software for Compute Accelerators and SmartNICs

Grant, Ryan; Levy, Scott L.N.; Schonbein, William W.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, William W.; Levy, Scott L.N.; Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI Scopus

Low-cost MPI Multithreaded Message Matching Benchmarking

Schonbein, William W.; Grant, Ryan; Levy, Scott L.N.; Dosanjh, Matthew G.; Marts, William P.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Intelligent Networks for High Performance Computing

Schonbein, William W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

RaDD Runtimes:Radical and Different Distributed Runtimes with SmartNICs

Grant, Ryan; Schonbein, William W.; Levy, Scott L.N.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Tail queues: A multi-threaded matching architecture

Concurrency and Computation: Practice and Experience

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick G.

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

Intelligent High-Performance Networks Via INCA

Grant, Ryan; Schonbein, William W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

MPI Tag Matching Performance on ConnectX and ARM

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Fuzzy matching: Hardware accelerated MPI communication middleware

Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick G.; Gazimirsaeed, S.M.; Afsahi, Ahmad

Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI-THREAD-MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Fuzzy Matching: Accelerating MPI Matching with Vector Comparisons Based on Partial Truth

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick; Gazimirsaeed, Mahdieh; Asafi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.; Grant, Ryan; Hjelm, Nathan; Levy, Scott L.N.; Schonbein, William W.

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

TYPE Book YEAR 2019

OSTI Scopus

The case for semi-permanent cache occupancy

ACM International Conference Proceeding Series

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus