Publications Search

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, Whit; Levy, Scott; Marts, William P.; Dosanjh, Matthew G.F.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI Scopus

Radd runtimes: Radical and different distributed runtimes with smartnics

Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Grant, Ryan; Schonbein, Whit; Levy, Scott

As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.

More Details

TYPE Conference Paper YEAR 2020

OSTI Scopus

Low-cost MPI Multithreaded Message Matching Benchmarking

Schonbein, Whit; Grant, Ryan; Levy, Scott; Dosanjh, Matthew G.F.; Marts, William P.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

RaDD Runtimes:Radical and Different Distributed Runtimes with SmartNICs

Grant, Ryan; Schonbein, Whit; Levy, Scott

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

Evaluating MPI Message Size Summary Statistics

Levy, Scott; Ferreira, Kurt B.

Abstract not provided.

More Details

TYPE Conference Proceeding YEAR 2020

OSTI

FY20 CSSE L2 Milestone 7186

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd; Levy, Scott; Lofstead, Gerald (Jay) F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig; Widener, Patrick; Oldfield, Ron

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI OSTI

Data Services for Visualization and Analysis - ASC Level II Milestone (7186)

Templet Jr., Gary J.; Glickman, Matthew R.; Kordenbrock, Todd; Levy, Scott; Lofstead, Gerald (Jay) F.; Mauldin, Jeff; Otahal, Thomas J.; Ulmer, Craig; Widener, Patrick; Oldfield, Ron

A new in transit Data Service is presented and compared to the traditional file-based workflow and the newly refactored in situ Catalyst workflow. Each workflow is enabled by the IOSS mesh interface equipped with data management layers for Exodus and CGNS (file-based), Catalyst (in situ), and FAODEL (in transit). FAODEL is a distributed object store that can transmit data across MPI allocations. Catalyst is a Para View-based visualization capability developed as part of the CSSE Data Services effort. The workflows considered here take SPARC data into Catalyst for visualization post-processing. Although still in unoptimized form, we show that the in transit approach is a viable alternative to file-based and in situ workflows and offers several advantages to both simulation and post-processing developers. Since IOSS is a mature interface with wide adoption across Sandia and externally, each workflow can be reconfigured to use different simulations that generate mesh data and post-processing tools that consume it.

More Details

TYPE SAND Report YEAR 2020

DOI OSTI

The Case for Explicit Reuse Semantics for RDMA Communication

Levy, Scott; Widener, Patrick; Ulmer, Craig; Kordenbrock, Todd

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

DOI OSTI

Space-Efficient Reed-Solomon Encoding to Detect and Correct Pointer Corruption

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Levy, Scott; Ferreira, Kurt B.

Concern about memory errors has been widespread in high-performance computing (HPC) for decades. These concerns have led to significant research on detecting and correcting memory errors to improve performance and provide strong guarantees about the correctness of the memory contents of scientific simulations. However, power concerns and changes in memory architectures threaten the viability of current approaches to protecting memory (e.g., Chipkill). Returning to less protective error-correcting codes (ECC), e.g., single-error correction, double-error detection (SECDED), may increase the frequency of memory errors, including silent data corruption (SDC). SDC has the potential to silently cause applications to produce incorrect results and mislead domain scientists. We propose an approach for exploiting unnecessary bits in pointer values to support encoding the pointer with a Reed-Solomon code. Encoding the pointer allows us to provides strong capabilities for correcting and detecting corruption of pointer values. In this paper, we provide a detailed description of how we can exploit unnecessary pointer bits to store Reed-Solomon parity symbols. We evaluate the performance impacts of this approach and examine the effectiveness of the approach against corruption. Our results demonstrate that encoding and decoding is fast (less than 45 per event) and that the protection it provides is robust (the rate of miscorrection is less than 5% even for significant corruption). The data and analysis presented in this paper demonstrates the power of our approach. It is fast, tunable, requires no additional per-pointer storage resources, and provides robust protection against pointer corruption.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Evaluating tradeoffs between MPI message matching offload hardware capacity and performance

ACM International Conference Proceeding Series

Levy, Scott; Ferreira, Kurt B.

Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Using simulation to examine the effect of MPI message matching costs on application performance

Parallel Computing

Levy, Scott; Ferreira, Kurt B.; Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.F.

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

TYPE Journal Article YEAR 2019

DOI DOI OSTI OSTI Scopus Scopus

Lessons learned from memory errors observed over the lifetime of Cielo

Levy, Scott; Ferreira, Kurt B.; Siddiqua, Taniya; Debardelebe, Nathan; Sridharan, Vilas; Baseman, Elisabeth

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching

Concurrency and Computation. Practice and Experience

Ferreira, Kurt B.; Grant, Ryan; Levenhagen, Michael; Levy, Scott; Groves, Taylor

Here, this paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. Finally, this paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.

More Details

TYPE Journal Article YEAR 2019

DOI DOI OSTI OSTI

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.F.; Grant, Ryan; Hjelm, Nathan; Levy, Scott; Schonbein, Whit

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

TYPE Book YEAR 2019

OSTI Scopus

Mediating Data Center Storage Diversity in HPC Applications with FAODEL

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ulmer, Craig; Levy, Scott; Kordenbrock, Todd; Templet, Gary J.

Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

SNL ATDM Software Ecosystem

Olivier, Stephen L.; Brightwell, Ronald B.; Bays, Nathan R.; Younge, Andrew J.; Evans, Noah; Levy, Scott; Ferreira, Kurt B.; Grant, Ryan

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Using simulation to examine the effect of MPI message matching costs on application performance

ACM International Conference Proceeding Series

Levy, Scott; Ferreira, Kurt B.

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

ASC ATDM Level 2 Milestone #6358: Assess Status of Next Generation Components and Physics Models in EMPIRE

Bettencourt, Matthew T.; Kramer, Richard M.J.; Cartwright, Keith L.; Phillips, Edward; Ober, Curtis C.; Pawlowski, Roger; Swan, Matthew S.; Tezaur, Irina K.; Phipps, Eric T.; Conde, Sidafa; Cyr, Eric C.; Ulmer, Craig; Kordenbrock, Todd; Levy, Scott; Templet, Gary J.; Hu, Jonathan J.; Lin, Paul T.; Glusa, Christian; Siefert, Christopher; Glass, Micheal W.

This report documents the outcome from the ASC ATDM Level 2 Milestone 6358: Assess Status of Next Generation Components and Physics Models in EMPIRE. This Milestone is an assessment of the EMPIRE (ElectroMagnetic Plasma In Realistic Environments) application and three software components. The assessment focuses on the electromagnetic and electrostatic particle-in-cell solutions for EMPIRE and its associated solver, time integration, and checkpoint-restart components. This information provides a clear understanding of the current status of the EMPIRE application and will help to guide future work in FY19 in order to ready the application for the ASC ATDM L1 Milestone in FY20. It is clear from this assessment that performance of the linear solver will have to be a focus in FY19.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Characterizing MPI matching via trace-based simulation

Parallel Computing

Ferreira, Kurt B.; Levy, Scott; Bays, Nathan R.; Grant, Ryan

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application's execution, including its matching performance, and is highly dependent on the MPI library's matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads and microbenchmarks, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

TYPE Conference Poster YEAR 2018

OSTI Scopus

Open science on Trinity's knights landing partition: An analysis of user job data

ACM International Conference Proceeding Series

Levy, Scott; Bays, Nathan R.; Ferreira, Kurt B.

High-performance computing (HPC) systems are critically important to the objectives of universities, national laboratories, and commercial companies. Because of the cost of deploying and maintaining these systems ensuring their efficient use is imperative. Job scheduling and resource management are critically important to the efficient use of HPC systems. As a result, significant research has been conducted on how to effectively schedule user jobs on HPC systems. Developing and evaluating job scheduling algorithms, however, requires a detailed understanding of how users request resources on HPC systems. In this paper, we examine a corpus of job data that was collected on Trinity, a leadership-class supercomputer. During the stabilization period of its Intel Xeon Phi (Knights Landing) partition, it was made available to users outside of a classified environment for the Trinity Open Science Phase 2 campaign. We collected information from the resource manager about each user job that was run during this Open Science period. In this paper, we examine the jobs contained in this dataset. Our analysis reveals several important characteristics of the jobs submitted during the Open Science period and provides critical insight into the use of one of the most powerful supercomputers in existence. Specifically, these data provide important guidance for the design, development, and evaluation of job scheduling and resource management algorithms.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Lessons Learned from Errors Observed over the Lifetime of Cielo

Levy, Scott; Ferreira, Kurt B.; Debardeleben, Nathan; Siddiqua, Taniya; Sridharan, Vilas; Baseman, Elisabeth

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Faodel: Data Management for Next-Generation Application Workflows

Ulmer, Craig; Mukherjee, Shyamali; Templet, Gary J.; Levy, Scott; Lofstead, Gerald (Jay) F.; Widener, Patrick; Lawson, Margaret

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

SNL ATDM: I/O and Data Management

Ulmer, Craig; Kordenbrock, Todd; Lawson, Margaret; Levy, Scott; Lofstead, Gerald (Jay) F.; Mukherjee, Shyamali; Sjaardema, Gregory D.; Templet, Gary J.; Ward, Harry L.; Widener, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

It’s not the heat, it’s the humidity: Scheduling resilience activity at scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ferreira, Kurt B.; Levy, Scott

Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.

More Details

TYPE Conference Poster YEAR 2018

OSTI Scopus

EMPRESS?Extensible Metadata PRovider for Extreme-scale Scientific Simulations

Lawson, Margaret; Lofstead, Gerald (Jay) F.; Levy, Scott; Widener, Patrick; Ulmer, Craig; Mukherjee, Shyamali; Templet, Gary J.; Kordenbrock, Todd

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Publications

Search results