Publications Search

An evaluation of open MPI's matching transport layer on the cray XT

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Graham, Richard L.; Brightwell, Ronald B.; Barrett, Brian; Bosilca, George; Pješivac-Grbović, Jelena

Open MPI was initially designed to support a wide variety of high-performance networks and network programming interfaces. Recently, Open MPI was enhanced to support networks that have full support for MPI matching semantics. Previous Open MPI efforts focused on networks that require the MPI library to manage message matching, which is sub-optimal for some networks that inherently support matching. We describes a new matching transport layer in Open MPI, present results of micro-benchmarks and several applications on the Cray XT platform, and compare performance of the new and the existing transport layers, as well as the vendor-supplied implementation of MPI. © Springer-Verlag Berlin Heidelberg 2007.

More Details

TYPE Conference YEAR 2007

Scopus OSTI

Investigations on InfiniBand: Efficient network buffer utilization at scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Shipman, Galen M.; Brightwell, Ronald B.; Barrett, Brian; Squyres, Jeffrey M.; Bloch, Gil

The default messaging model for the OpenFabrics "Verbs" API is to consume receive buffers in order - regardless of the actual incoming message size - leading to inefficient registered memory usage. For example, many small messages can consume large amounts of registered memory. This paper introduces a new transport protocol in Open MPI implemented using the existing OpenFabrics Verbs API that exhibits efficient registered memory utilization. Several real-world applications were run at scale with the new protocol; results show that global network resource utilization efficiency increases, allowing increased scalability - and larger problem sizes - on clusters which can increase application performance in some cases. © Springer-Verlag Berlin Heidelberg 2007.

More Details

TYPE Conference YEAR 2007

Scopus OSTI

A simple synchronous distributed-memory algorithm for the HPCC RandomAccess benchmark

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Plimpton, Steven J.; Brightwell, Ronald B.; Vaughan, Courtenay T.; Underwood, Keith D.

The RandomAccess benchmark as defined by the High Performance Computing Challenge (HPCC) tests the speed at which a machine can update the elements of a table spread across global system memory, as measured in billions (giga) of updates per second (GUPS). The parallel implementation provided by HPCC typically performs poorly on distributed-memory machines, due to updates requiring numerous small point-to-point messages between processors. We present an alternative algorithm which treats the collection of P processors as a hypercube, aggregating data so that larger messages are sent, and routing individual datums through dimensions of the hypercube to their destination processor. The algorithm's computation (the GUP count) scales linearly with P while its communication overhead scales as log2(P), thus enabling better performance on large numbers of processors. The new algorithm achieves a GUPS rate of 19.98 on 8192 processors of Sandia's Red Storm machine, compared to 1.02 for the HPCC-provided algorithm on 10350 processors. We also illustrate how GUPS performance varies with the benchmark's specification of its "look-ahead" parameter. As expected, parallel performance degrades for small look-ahead values, and improves dramatically for large values. © 2006 IEEE.

More Details

TYPE Conference YEAR 2006

OSTI Scopus

An Infrastructure for Characterizing the Sensitivity of Parallel Applications to OS Noise

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark

Underwood, Keith D.; Plimpton, Steven J.; Brightwell, Ronald B.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

The portals 3.3 message passing interface document revision 2.1

Riesen, Rolf; Brightwell, Ronald B.; Pedretti, Kevin T.T.

Abstract not provided.

More Details

TYPE SAND Report YEAR 2006

DOI OSTI

Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking

Brightwell, Ronald B.; Goudy, Sue P.; Rodrigues, Arun; Underwood, Keith D.

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI. © 2006 Inderscience Enterprises Ltd.

More Details

TYPE Journal Article YEAR 2006

Scopus OSTI

Measuring MPI send and receive overhead and application availability in high performance network interfaces

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Doerfler, Douglas W.; Brightwell, Ronald B.

In evaluating new high-speed network interfaces, the usual metrics of latency and bandwidth are commonly measured and reported. There are numerous other message passing characteristics that can have a dramatic effect on application performance that should be analyzed when evaluating a new interconnect. One such metric is overhead, which dictates the networks ability to allow the application to perform non-message passing work while a transfer is taking place. A method for measuring overhead, and hence calculating application availability, is presented. Results for several next-generation network interfaces are also presented. © Springer-Verlag Berlin Heidelberg 2006.

More Details

TYPE Conference YEAR 2006

OSTI Scopus

Enhancing NIC performance for MPI using processing-in-memory

Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

Rodrigues, Arun; Murphy, Richard; Brightwell, Ronald B.; Underwood, Keith D.

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

More Details

TYPE Conference YEAR 2005

Scopus OSTI

Analyzing the impact of overlap, offload, and independent progress for MPI

Proposed for publication in the International Journal of High Performance Computing Applications.

Brightwell, Ronald B.; Riesen, Rolf; Underwood, Keith D.

The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. The study is conducted on two different platforms with each having two MPI implementations (one with and one without independent progress). Thus, identical network hardware and virtually identical software stacks are used. Furthermore, one platform, ASCI Red, allows further separation of features such as overlap and offload. Thus, this paper extends previous work by further qualifying the source of the performance advantage: offload, overlap, or independent progress.

More Details

TYPE Journal Article YEAR 2005

OSTI

A hardware acceleration unit for MPE queue processing

Hemmert, Karl S.; Brightwell, Ronald B.; Rodrigues, Arun; Murphy, Richard C.; Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2004

OSTI

Advanced parallel programming models research and development opportunities

Brightwell, Ronald B.; Wen, Zhaofang

There is currently a large research and development effort within the high-performance computing community on advanced parallel programming models. This research can potentially have an impact on parallel applications, system software, and computing architectures in the next several years. Given Sandia's expertise and unique perspective in these areas, particularly on very large-scale systems, there are many areas in which Sandia can contribute to this effort. This technical report provides a survey of past and present parallel programming model research projects and provides a detailed description of the Partitioned Global Address Space (PGAS) programming model. The PGAS model may offer several improvements over the traditional distributed memory message passing model, which is the dominant model currently being used at Sandia. This technical report discusses these potential benefits and outlines specific areas where Sandia's expertise could contribute to current research activities. In particular, we describe several projects in the areas of high-performance networking, operating systems and parallel runtime systems, compilers, application development, and performance evaluation.

More Details

TYPE SAND Report YEAR 2004

DOI OSTI

Experiences implementing sorting algorithms in Unified Parallel C

Proposed for publication in IEEE Transactions on Parallel and Distributed Systems.

Brightwell, Ronald B.; Brown, Jonathan L.; Wen, Zhaofang

Abstract not provided.

More Details

TYPE Journal Article YEAR 2004

OSTI

A NIC-offload implementation of portals for quadrics QsNet

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2004

OSTI

On characterizing the network resource usage of MPI applications

Proposed for publication in the Journal of Parallel and Distributed Computing.

Brightwell, Ronald B.; Phelps, Sue C.; Underwood, Keith D.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2004

OSTI

Implications of a PIM architectural model for MPI

Underwood, Keith D.; Brightwell, Ronald B.

Memory may be the only system component that is more commoditized than a microprocessor. To simultaneously exploit this and address the impending memory wall, processing in memory (PIM) research efforts are considering ways to move processing into memory without significantly increasing the cost of the memory. As such, PIM devices may become the basis for future commodity clusters. Although these PIM devices may leverage new computational paradigms such as hardware support for multi-threading and traveling threads, they must provide support for legacy programming models if they are to supplant commodity clusters. This paper presents a prototype implementation of MPI over a traveling thread mechanism called parcels. A performance analysis indicates that the direct hardware support of a traveling thread model can lead to an efficient, lightweight MPI implementation.

More Details

TYPE Conference YEAR 2003

OSTI

Evaluation of an eager protocol optimization for MPI

Brightwell, Ronald B.; Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Measuring MPI latency variance

Riesen, Rolf; Brightwell, Ronald B.; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

An MPI tool to measure application sensitivity to variation in communication parameters

Brightwell, Ronald B.; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Architectural specification for massively parallel computers : an experience and measurement-based approach

Proposed for publication in the Special Issue of Concurrency and Computation: Practice and Experience - The High Performance Architectural Challenge: Mass Market Versus Proprietary Components.

Brightwell, Ronald B.; Camp, William J.; Cole, Benjamin; Debenedictis, Erik; Leland, Robert W.; Tomkins, James L.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2003

OSTI

Design, implementation, and performance of MPI on portals 3.0

Proposed for publication in the International Journal of High Performance Computing Applications - Special Issue: Best Papers of EuroPVMMPI 2002.

Brightwell, Ronald B.; Riesen, Rolf; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2002

OSTI

On the appropriateness of commodity operating systems for large-scale, balanced computing systems

Brightwell, Ronald B.; Maccabe, Arthur B.; Riesen, Rolf

Abstract not provided.

More Details

TYPE Conference YEAR 2002

OSTI

Programming Paradigms for Massively Parallel Computers: LDRD Project Final Report

Brightwell, Ronald B.

This technical report presents the initial proposal and renewable proposals for an LDRD project whose intended goal was to enable applications to take full advantage of the hardware available on Sandia's current and future massively parallel supercomputers by analyzing various ways of combining distributed-memory and shared-memory programming models. Despite Sandia's enormous success with distributed-memory parallel machines and the message-passing programming model, clusters of shared-memory processors appeared to be the massively parallel architecture of the future at the time this project was proposed. They had hoped to analyze various hybrid programming models for their effectiveness and characterize the types of application to which each model was well-suited. The report presents the initial research proposal and subsequent continuation proposals that highlight the proposed work and summarize the accomplishments.

More Details

TYPE Report YEAR 2001

DOI OSTI

Scalability limitations of VIA-based technologies in supporting MPI

Brightwell, Ronald B.; Maccabe, Arthur B.

This paper analyzes the scalability limitations of networking technologies based on the Virtual Interface Architecture (VIA) in supporting the runtime environment needed for an implementation of the Message Passing Interface. The authors present an overview of the important characteristics of VIA and an overview of the runtime system being developed as part of the Computational Plant (Cplant) project at Sandia National Laboratories. They discuss the characteristics of VIA that prevent implementations based on this system to meet the scalability and performance requirements of Cplant.

More Details

TYPE Conference YEAR 2000

OSTI

Scalability and Performance of a Large Linux Cluster

Journal of Parallel and Distributed Computing

Brightwell, Ronald B.; Plimpton, Steven J.

In this paper the authors present performance results from several parallel benchmarks and applications on a 400-node Linux cluster at Sandia National Laboratories. They compare the results on the Linux cluster to performance obtained on a traditional distributed-memory massively parallel processing machine, the Intel TeraFLOPS. They discuss the characteristics of these machines that influence the performance results and identify the key components of the system software that they feel are important to allow for scalability of commodity-based PC clusters to hundreds and possibly thousands of processors.

More Details

TYPE Journal Article YEAR 2000

OSTI

Publications

Search results