Publications

Results 151–175 of 190
Skip to search filters

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

Instrumentation and analysis of MPI queue times on the seaStar high-performance network

Proceedings - International Conference on Computer Communications and Networks, ICCCN

Brightwell, Ronald B.; Pedretti, Kevin P.; Ferreira, Kurt

Understanding the communication behavior and network resource usage of parallel applications is critical to achieving high performance and scalability on systems with tens of thousands of network endpoints. The need for better understanding is not only driven by the desire to identify potential performance optimization opportunities for current networks, but is also a necessity for designing next-generation networking hardware. In this paper, we describe our approach to instrumenting the SeaStar interconnect on the Cray XT series of massively parallel processing machines to gather low-level network timing data. This data provides a new perspective on performance evaluation, both in terms of evaluating the resource usage patterns of applications as well as evaluating different implementation strategies in the network protocol stack. © 2008 IEEE.

More Details

A prototype implementation of MPI for SMARTMAP

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Brightwell, Ronald B.

Recently the Catamount lightweight kernel was extended to support direct access shared memory between processes running on the same compute node. This extension, called SMARTMAP, allows each process read/write access to another process' memory by extending the virtual address mapping. Simple virtual address bit manipulation can be used to access the same virtual address in a different process' address space. This paper describes a prototype implementation of MPI that uses SMARTMAP for intra-node message passing. SMARTMAP has several advantages over POSIX shared memory techniques for implementing MPI. We present performance results comparing MPI using SMARTMAP to the existing MPI transport layer on a quad-core Cray XT platform. © 2008 Springer-Verlag Berlin Heidelberg.

More Details

Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07

Underwood, Keith; Levenhagen, Michael J.; Brightwell, Ronald B.

Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark. (c) 2007 ACM.

More Details

An evaluation of open MPI's matching transport layer on the cray XT

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Graham, Richard L.; Brightwell, Ronald B.; Barrett, Brian; Bosilca, George; Pješivac-Grbović, Jelena

Open MPI was initially designed to support a wide variety of high-performance networks and network programming interfaces. Recently, Open MPI was enhanced to support networks that have full support for MPI matching semantics. Previous Open MPI efforts focused on networks that require the MPI library to manage message matching, which is sub-optimal for some networks that inherently support matching. We describes a new matching transport layer in Open MPI, present results of micro-benchmarks and several applications on the Cray XT platform, and compare performance of the new and the existing transport layers, as well as the vendor-supplied implementation of MPI. © Springer-Verlag Berlin Heidelberg 2007.

More Details

Investigations on InfiniBand: Efficient network buffer utilization at scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Shipman, Galen M.; Brightwell, Ronald B.; Barrett, Brian; Squyres, Jeffrey M.; Bloch, Gil

The default messaging model for the OpenFabrics "Verbs" API is to consume receive buffers in order - regardless of the actual incoming message size - leading to inefficient registered memory usage. For example, many small messages can consume large amounts of registered memory. This paper introduces a new transport protocol in Open MPI implemented using the existing OpenFabrics Verbs API that exhibits efficient registered memory utilization. Several real-world applications were run at scale with the new protocol; results show that global network resource utilization efficiency increases, allowing increased scalability - and larger problem sizes - on clusters which can increase application performance in some cases. © Springer-Verlag Berlin Heidelberg 2007.

More Details

Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking

Brightwell, Ronald B.; Goudy, Sue P.; Rodrigues, Arun; Underwood, Keith D.

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI. © 2006 Inderscience Enterprises Ltd.

More Details

Measuring MPI send and receive overhead and application availability in high performance network interfaces

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Doerfler, Douglas W.; Brightwell, Ronald B.

In evaluating new high-speed network interfaces, the usual metrics of latency and bandwidth are commonly measured and reported. There are numerous other message passing characteristics that can have a dramatic effect on application performance that should be analyzed when evaluating a new interconnect. One such metric is overhead, which dictates the networks ability to allow the application to perform non-message passing work while a transfer is taking place. A method for measuring overhead, and hence calculating application availability, is presented. Results for several next-generation network interfaces are also presented. © Springer-Verlag Berlin Heidelberg 2006.

More Details

Enhancing NIC performance for MPI using processing-in-memory

Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

Rodrigues, Arun; Murphy, Richard; Brightwell, Ronald B.; Underwood, Keith D.

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

More Details

Architectural specification for massively parallel computers: An experience and measurement-based approach

Concurrency and Computation: Practice and Experience

Brightwell, Ronald B.; Camp, William; Cole, Benjamin; DeBenedictis, Erik; Leland, Robert; Tomkins, James; Maccabe, Arthur B.

In this paper, we describe the hardware and software architecture of the Red Storm system developed at Sandia National Laboratories. We discuss the evolution of this architecture and provide reasons for the different choices that have been made. We contrast our approach of leveraging high-volume, mass-market commodity processors to that taken for the Earth Simulator. We present a comparison of benchmarks and application performance that support our approach. We also project the performance of Red Storm and the Earth Simulator. This projection indicates that the Red Storm architecture is a much more cost-effective approach to massively parallel computing. Published in 2005 by John Wiley & Sons, Ltd.

More Details
Results 151–175 of 190
Results 151–175 of 190