Page 4 – Center for Computing Research (CCR)

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Stearley, Jon S.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Transparent redundant computing with MPI

Brightwell, Ronald B.; Ferreira, Kurt

Extreme-scale parallel systems will require alternative methods for applications to maintain current levels of uninterrupted execution. Redundant computation is one approach to consider, if the benefits of increased resiliency outweigh the cost of consuming additional resources. We describe a transparent redundancy approach for MPI applications and detail two different implementations that provide the ability to tolerate a range of failure scenarios, including loss of application processes and connectivity.We compare these two approaches and show performance results from micro-benchmarks that bound worst-case message passing performance degradation.We propose several enhancements that could lower the overhead of providing resiliency through redundancy.

More Details

TYPE Conference YEAR 2010

OSTI

Reliability modeling of redundancy for HPC systems

Ferreira, Kurt; Riesen, Rolf; Robinson, David G.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Modeling Job Interrupts on Redundant Computing HPC Systems

Ferreira, Kurt; Riesen, Rolf; Robinson, David G.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

System Software Research for Extreme-Scale Computing

Oldfield, Ron A.; Brightwell, Ronald B.; Pedretti, Kevin P.; Riesen, Rolf; Ferreira, Kurt; Kelly, Suzanne M.; Laros, James H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI

An extensible operating system design for large-scale parallel machines

Riesen, Rolf; Ferreira, Kurt

Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and reconfigurable compute elements. An operating system for such machines must be adaptable to the requirements of specific applications and provide abstractions to access next-generation hardware features, without sacrificing performance or scalability.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

An Extensible Operating System Design for Large-Scale Parallel Machines

Riesen, Rolf; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Instrumentation and analysis of MPI queue times on the seaStar high-performance network

Proceedings - International Conference on Computer Communications and Networks, ICCCN

Brightwell, Ronald B.; Pedretti, Kevin P.; Ferreira, Kurt

Understanding the communication behavior and network resource usage of parallel applications is critical to achieving high performance and scalability on systems with tens of thousands of network endpoints. The need for better understanding is not only driven by the desire to identify potential performance optimization opportunities for current networks, but is also a necessity for designing next-generation networking hardware. In this paper, we describe our approach to instrumenting the SeaStar interconnect on the Cray XT series of massively parallel processing machines to gather low-level network timing data. This data provides a new perspective on performance evaluation, both in terms of evaluating the resource usage patterns of applications as well as evaluating different implementation strategies in the network protocol stack. © 2008 IEEE.

More Details

TYPE Conference YEAR 2008

Scopus OSTI