Publications Search

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

More Details

TYPE SAND Report YEAR 2009

DOI OSTI

Investigating Methods of Supporting Dynamically Linked Executables on High Performance Computing Platforms

Laros, James H.; Kelly, Suzanne M.; Levenhagen, Michael; Pedretti, Kevin T.T.

Shared libraries have become ubiquitous and are used to achieve great resource efficiencies on many platforms. The same properties that enable efficiencies on time-shared computers and convenience on small clusters prove to be great obstacles to scalability on large clusters and High Performance Computing platforms. In addition, Light Weight operating systems such as Catamount have historically not supported the use of shared libraries specifically because they hinder scalability. In this report we will outline the methods of supporting shared libraries on High Performance Computing platforms using Light Weight kernels that we investigated. The considerations necessary to evaluate utility in this area are many and sometimes conflicting. While our initial path forward has been determined based on this evaluation we consider this effort ongoing and remain prepared to re-evaluate any technology that might provide a scalable solution. This report is an evaluation of a range of possible methods of supporting dynamically linked executables on capability class1 High Performance Computing platforms. Efforts are ongoing and extensive testing at scale is necessary to evaluate performance. While performance is a critical driving factor, supporting whatever method is used in a production environment is an equally important and challenging task.

More Details

TYPE SAND Report YEAR 2009

DOI OSTI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI

Experiences with IO Performance Analysis on Red Storm

Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Red Storm IO Performance Analysis

Laros, James H.; Ward, Harry L.; Kelly, Suzanne M.; Kellogg, Brian R.; Tomkins, James L.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

A Minimal Linux Environment for High Performance Computing Systems (Presentation)

Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

A software and hardware architecture for a modular, portable, extensible reliability availability and serviceability system

Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2005

OSTI

Future directions in cluster system software

Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2004

OSTI

Implementing scalable disk-less clusters using the Network File System (NFS)

Laros, James H.; Ward, Harry L.

This paper describes a methodology for implementing disk-less cluster systems using the Network File System (NFS) that scales to thousands of nodes. This method has been successfully deployed and is currently in use on several production systems at Sandia National Labs. This paper will outline our methodology and implementation, discuss hardware and software considerations in detail and present cluster configurations with performance numbers for various management operations like booting.

More Details

TYPE Conference YEAR 2003

OSTI

The Cluster Integration Toolkit (CIT) : an extensible, portable, scalable cluster management software implementation

Laros, James H.; Ward, Harry L.; Dauchy, Nathan W.; Vasak, James S.; Klundt, Ruth A.; Laguna, Glenn A.; Epperson, Marcus; Stearley, Jon S.

Abstract not provided.

More Details

TYPE Conference YEAR 2003

OSTI

Publications

Search results