Publications

24 Results

Reliability modeling of redundant computation for HPC systems

Robinson, David R.; Ferreira, Kurt; Riesen, Rolf

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Fault-tolerance for exascale systems

Ferreira, Kurt; Riesen, Rolf

Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.

More Details

TYPE Conference YEAR 2010

OSTI

Cache injection for parallel applications

Riesen, Rolf; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Stearley, Jon S.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Reliability modeling of redundancy for HPC systems

Ferreira, Kurt; Riesen, Rolf; Robinson, David G.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Modeling Job Interrupts on Redundant Computing HPC Systems

Ferreira, Kurt; Riesen, Rolf; Robinson, David G.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

System Software Research for Extreme-Scale Computing

Oldfield, Ron A.; Brightwell, Ronald B.; Pedretti, Kevin P.; Riesen, Rolf; Ferreira, Kurt; Kelly, Suzanne M.; Laros, James H.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI

An extensible operating system design for large-scale parallel machines

Riesen, Rolf; Ferreira, Kurt

Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and reconfigurable compute elements. An operating system for such machines must be adaptable to the requirements of specific applications and provide abstractions to access next-generation hardware features, without sacrificing performance or scalability.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI

An Extensible Operating System Design for Large-Scale Parallel Machines

Riesen, Rolf; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Simulating a Cluster on a Cluster

Abstract not provided.

More Details

TYPE Conference YEAR 2008

OSTI

Designing and Implementing Lightweight Kernels for Capability Computing

Concurrency and Computation: Practice and Experience

Riesen, Rolf; Brightwell, Ronald B.; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Journal Article YEAR 2008

OSTI

Modeling the Impact of Checkpoints on Next-Generation Systems

Oldfield, Ron A.; Riesen, Rolf

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Scalable Collection of Large MPI Traces on Red Storm

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Modeling the impact of checkpoints on next-generation systems

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Seshat collects MPI traces: Extended abstract

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Abstract not provided.

More Details

TYPE Conference YEAR 2007

Scopus OSTI

Supercomputing System Design Through Simulation

Abstract not provided.

More Details

TYPE Presentation YEAR 2006

OSTI

The portals 3.3 message passing interface document revision 2.1

Riesen, Rolf; Brightwell, Ronald B.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE SAND Report YEAR 2006

OSTI DOI

A lightweight approach to file system development

Oldfield, Ron A.; Riesen, Rolf; Ward, Harry L.; Lawry, William L.

Abstract not provided.

More Details

TYPE Conference YEAR 2005

OSTI

Analyzing the impact of overlap, offload, and independent progress for MPI

Proposed for publication in the International Journal of High Performance Computing Applications.

Brightwell, Ronald B.; Riesen, Rolf; Underwood, Keith

The overlap of computation and communication has long been considered to be a significant performance benefit for applications. Similarly, the ability of the Message Passing Interface (MPI) to make independent progress (that is, to make progress on outstanding communication operations while not in the MPI library) is also believed to yield performance benefits. Using an intelligent network interface to offload the work required to support overlap and independent progress is thought to be an ideal solution, but the benefits of this approach have not been studied in depth at the application level. This lack of analysis is complicated by the fact that most MPI implementations do not sufficiently support overlap or independent progress. Recent work has demonstrated a quantifiable advantage for an MPI implementation that uses offload to provide overlap and independent progress. The study is conducted on two different platforms with each having two MPI implementations (one with and one without independent progress). Thus, identical network hardware and virtually identical software stacks are used. Furthermore, one platform, ASCI Red, allows further separation of features such as overlap and offload. Thus, this paper extends previous work by further qualifying the source of the performance advantage: offload, overlap, or independent progress.

More Details

TYPE Journal Article YEAR 2005

OSTI

Simple, scalable protocols for high-performance local networks

Riesen, Rolf; Riesen, Rolf; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Measuring MPI latency variance

Riesen, Rolf; Riesen, Rolf; Brightwell, Ronald B.; Maccabe, Arthur B.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Design, implementation, and performance of MPI on Portals 3.0

International Journal of High Performance Computing Applications

Brightwell, Ronald B.; Riesen, Rolf; Maccabe, Arthur B.

This paper describes an implementation of the Message Passing Interface (MPI) on the Portals 3.0 data movement layer. Portals 3.0 provides low-level building blocks that are flexible enough to support higher-level message passing layers, such as MPI, very efficiently. Portals 3.0 is also designed to allow for programmable network interface cards to offload message processing from the host processor, allowing for the ability to overlap computation and MPI communication. We describe the basic building blocks in Portals 3.0, show how they can be put together to implement MPI, and describe the protocols of our MPI implementation. We look at several key operations within the implementation and describe the effects that a Portals 3.0 implementation has on scalability and performance. We also present preliminary performance results from our implementation for Myrinet.

More Details

TYPE Journal Article YEAR 2003

Scopus OSTI

On the appropriateness of commodity operating systems for large-scale, balanced computing systems

Brightwell, Ronald B.; Brightwell, Ronald B.; Maccabe, Arthur B.; Riesen, Rolf

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

24 Results

24 Results