Publications Search

Analyzing the impact of overlap, offload, and independent progress for MPI

Proposed for publication in the International Journal of High Performance Computing Applications.

Brightwell, Ronald B.; Riesen, Rolf; Underwood, Keith

The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. The study is conducted on two different platforms with each having two MPI implementations (one with and one without independent progress). Thus, identical network hardware and virtually identical software stacks are used. Furthermore, one platform, ASCI Red, allows further separation of features such as overlap and offload. Thus, this paper extends previous work by further qualifying the source of the performance advantage: offload, overlap, or independent progress.

More Details

TYPE Journal Article YEAR 2005

OSTI

A hardware acceleration unit for MPE queue processing

Hemmert, Karl S.; Brightwell, Ronald B.; Rodrigues, Arun; Murphy, Richard C.; Underwood, Keith

Abstract not provided.

More Details

TYPE Conference YEAR 2004

OSTI

Advanced parallel programming models research and development opportunities

Brightwell, Ronald B.; Wen, Zhaofang W.

There is currently a large research and development effort within the high-performance computing community on advanced parallel programming models. This research can potentially have an impact on parallel applications, system software, and computing architectures in the next several years. Given Sandia's expertise and unique perspective in these areas, particularly on very large-scale systems, there are many areas in which Sandia can contribute to this effort. This technical report provides a survey of past and present parallel programming model research projects and provides a detailed description of the Partitioned Global Address Space (PGAS) programming model. The PGAS model may offer several improvements over the traditional distributed memory message passing model, which is the dominant model currently being used at Sandia. This technical report discusses these potential benefits and outlines specific areas where Sandia's expertise could contribute to current research activities. In particular, we describe several projects in the areas of high-performance networking, operating systems and parallel runtime systems, compilers, application development, and performance evaluation.

More Details

TYPE SAND Report YEAR 2004

OSTI DOI

Experiences implementing sorting algorithms in Unified Parallel C

Proposed for publication in IEEE Transactions on Parallel and Distributed Systems.

Brightwell, Ronald B.; Brown, Jonathan L.; Wen, Zhaofang W.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2004

OSTI

A NIC-offload implementation of portals for quadrics QsNet

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2004

OSTI

On characterizing the network resource usage of MPI applications

Proposed for publication in the Journal of Parallel and Distributed Computing.

Brightwell, Ronald B.; Phelps, Sue C.; Underwood, Keith

Abstract not provided.

More Details

TYPE Journal Article YEAR 2004

OSTI

Implications of a PIM architectural model for MPI

Underwood, Keith; Brightwell, Ronald B.; Underwood, Keith

Memory may be the only system component that is more commoditized than a microprocessor. To simultaneously exploit this and address the impending memory wall, processing in memory (PIM) research efforts are considering ways to move processing into memory without significantly increasing the cost of the memory. As such, PIM devices may become the basis for future commodity clusters. Although these PIM devices may leverage new computational paradigms such as hardware support for multi-threading and traveling threads, they must provide support for legacy programming models if they are to supplant commodity clusters. This paper presents a prototype implementation of MPI over a traveling thread mechanism called parcels. A performance analysis indicates that the direct hardware support of a traveling thread model can lead to an efficient, lightweight MPI implementation.

More Details

TYPE Conference YEAR 2003

OSTI

Measuring MPI latency variance

Riesen, Rolf; Riesen, Rolf; Brightwell, Ronald B.; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Design, implementation, and performance of MPI on Portals 3.0

International Journal of High Performance Computing Applications

Brightwell, Ronald B.; Riesen, Rolf; Maccabe, Arthur B.

This paper describes an implementation of the Message Passing Interface (MPI) on the Portals 3.0 data movement layer. Portals 3.0 provides low-level building blocks that are flexible enough to support higher-level message passing layers, such as MPI, very efficiently. Portals 3.0 is also designed to allow for programmable network interface cards to offload message processing from the host processor, allowing for the ability to overlap computation and MPI communication. We describe the basic building blocks in Portals 3.0, show how they can be put together to implement MPI, and describe the protocols of our MPI implementation. We look at several key operations within the implementation and describe the effects that a Portals 3.0 implementation has on scalability and performance. We also present preliminary performance results from our implementation for Myrinet.

More Details

TYPE Journal Article YEAR 2003

Scopus OSTI

Evaluation of an eager protocol optimization for MPI

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Brightwell, Ronald B.; Underwood, Keith

Nearly all implementations of the Message Passing Interface (MPI) employ a two-level protocol for point-to-point messages. Short messages are sent eagerly to optimize for latency, and long messages are typically implemented using a rendezvous mechanism. In a rendezvous implementation, the sender must first send a request and receive an acknowledgment before the data can be transferred. While there are several possible reasons for using this strategy for long messages, most implementations are forced to use a rendezvous strategy due to operating system and/or network limitations. In this paper, we compare an implementation that uses a rendezvous protocol for long messages with an implementation that adds an eager optimization for long messages. We discuss implementation issues and provide a performance comparison for several micro-benchmarks. We also present a new micro-benchmark that may provide better insight into how these different protocols effect application performance. Results for this new benchmark indicate that, for larger messages, a significant number of receives must be pre-posted in order for an eager protocol optimization to out-perform a rendezvous protocol. © Springer-Verlag Berlin Heidelberg 2003.

More Details

TYPE Conference YEAR 2003

Scopus OSTI

An MPI tool to measure application sensitivity to variation in communication parameters

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

León, Edgar A.; Maccabe, Arthur B.; Brightwell, Ronald B.

This work describes an apparatus which can be used to vary communication performance parameters for MPI applications, and provides a tool to analyze the impact of communication performance on parallel applications. Our tool is based on Myrinet (along with GM). We use an extension of the LogP model to allow greater flexibility in determining the parameter(s) to which parallel applications may be sensitive. We show that individual communication parameters can be independently controlled within a small percentage error. We also present the results of using our tool on a suite of parallel benchmarks. © Springer-Verlag Berlin Heidelberg 2003.

More Details

TYPE Conference YEAR 2003

Scopus OSTI

On the appropriateness of commodity operating systems for large-scale, balanced computing systems

Brightwell, Ronald B.; Brightwell, Ronald B.; Maccabe, Arthur B.; Riesen, Rolf

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Programming Paradigms for Massively Parallel Computers: LDRD Project Final Report

Brightwell, Ronald B.

This technical report presents the initial proposal and renewable proposals for an LDRD project whose intended goal was to enable applications to take full advantage of the hardware available on Sandia's current and future massively parallel supercomputers by analyzing various ways of combining distributed-memory and shared-memory programming models. Despite Sandia's enormous success with distributed-memory parallel machines and the message-passing programming model, clusters of shared-memory processors appeared to be the massively parallel architecture of the future at the time this project was proposed. They had hoped to analyze various hybrid programming models for their effectiveness and characterize the types of application to which each model was well-suited. The report presents the initial research proposal and subsequent continuation proposals that highlight the proposed work and summarize the accomplishments.

More Details

TYPE Report YEAR 2001

OSTI DOI

Scalability limitations of VIA-based technologies in supporting MPI

Brightwell, Ronald B.; Maccabe, Arthur B.

This paper analyzes the scalability limitations of networking technologies based on the Virtual Interface Architecture (VIA) in supporting the runtime environment needed for an implementation of the Message Passing Interface. The authors present an overview of the important characteristics of VIA and an overview of the runtime system being developed as part of the Computational Plant (Cplant) project at Sandia National Laboratories. They discuss the characteristics of VIA that prevent implementations based on this system to meet the scalability and performance requirements of Cplant.

More Details

TYPE Conference YEAR 2000

OSTI

Scalability and Performance of a Large Linux Cluster

Journal of Parallel and Distributed Computing

Brightwell, Ronald B.; Plimpton, Steven J.

In this paper the authors present performance results from several parallel benchmarks and applications on a 400-node Linux cluster at Sandia National Laboratories. They compare the results on the Linux cluster to performance obtained on a traditional distributed-memory massively parallel processing machine, the Intel TeraFLOPS. They discuss the characteristics of these machines that influence the performance results and identify the key components of the system software that they feel are important to allow for scalability of commodity-based PC clusters to hundreds and possibly thousands of processors.

More Details

TYPE Journal Article YEAR 2000

OSTI