Publications Search

Checkpoint Compression for Extreme Scale Fault Tolerance

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Using triggered operations to offload rendezvous messages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Barrett, Brian; Brightwell, Ronald B.; Hemmert, Karl S.; Wheeler, Kyle B.

Historically, MPI implementations have had to choose between eager messaging protocols that require buffering and rendezvous protocols that sacrifice overlap and strong independent progress in some scenarios. The typical choice is to use an eager protocol for short messages and switch to a rendezvous protocol for long messages. If overlap and progress are desired, some implementations offer the option of using a thread. We propose an approach that leverages triggered operations to implement a long message rendezvous protocol that provides strong progress guarantees. The results indicate that a triggered operation based rendezvous can achieve better overlap than a traditional rendezvous implementation and less wasted bandwidth than an eager long protocol. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2011

OSTI Scopus

Libhashckpt: Hash-based incremental checkpointing using GPU's

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ferreira, Kurt; Riesen, Rolf; Brightwell, Ronald B.; Bridges, Patrick; Arnold, Dorian

Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability applications have custom checkpoint strategies to minimize state and reduce checkpoint time. One well-known optimization to traditional checkpoint/restart is incremental checkpointing, which has a number of known limitations. To address these limitations, we introduce libhashckpt; a hybrid incremental checkpointing solution that uses both page protection and hashing on GPUs to determine changes in application data with very low overhead. Using real capability workloads, we show the merit of this technique for a certain class of HPC applications. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Keeping checkpoint/restart viable for exascale systems

Ferreira, Kurt; Oldfield, Ron; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systems due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

The Impact of Injection Bandwidth Performance on Application Scalability

Pedretti, Kevin; Brightwell, Ronald B.; Doerfler, Douglas W.; Hemmert, Karl S.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

An Intra-Node Implementation of OpenSHMEM Using Virtual Address Space Mapping

Brightwell, Ronald B.; Pedretti, Kevin

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Why Nobody Should Care About Operating Systems for Exascale

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

An Overview of Portals 4

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Cooperative Application/OS DRAM Fault Recovery

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Enhanced Support for PGAS Communication in Portals

Barrett, Brian; Brightwell, Ronald B.; Hemmert, Karl S.; Pedretti, Kevin T.T.; Wheeler, Kyle B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Enabling Flexible Collective Communication Offload with Triggered Operations

Hemmert, Karl S.; Barrett, Brian; Brightwell, Ronald B.; Levenhagen, Michael

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

More Details

TYPE SAND Report YEAR 2011

DOI OSTI

Evaluating the Viability of Process Replication Reliability for Exascale Systems

Ferreira, Kurt; Stearley, Jon S.; Laros, James H.; Oldfield, Ron; Pedretti, Kevin T.T.; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Porting Portals to OFED

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Report YEAR 2011

OSTI

A Perspective on Operating and Runtime Systems for Exascale Computing

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Report YEAR 2011

OSTI

Systems Software

Kelly, Suzanne M.; Brightwell, Ronald B.; Ballance, Robert A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Gibraltar RAID - 2011 R&D 100 Awards Entry Form

Curry, Matthew L.; Ward, Harry L.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

System Software Report from ASC/CSSE-FOUS Exascale Planning

Minnich, Ronald G.; Brightwell, Ronald B.; Kelly, Suzanne M.; Ballance, Robert A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2011

OSTI

Redundant computing for exascale systems

Ferreira, Kurt; Stearley, Jon S.; Oldfield, Ron; Laros, James H.; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

LDRD final report : a lightweight operating system for multi-core capability class supercomputers

Pedretti, Kevin T.T.; Levenhagen, Michael; Ferreira, Kurt; Brightwell, Ronald B.; Kelly, Suzanne M.; Bridges, Patrick G.

The two primary objectives of this LDRD project were to create a lightweight kernel (LWK) operating system(OS) designed to take maximum advantage of multi-core processors, and to leverage the virtualization capabilities in modern multi-core processors to create a more flexible and adaptable LWK environment. The most significant technical accomplishments of this project were the development of the Kitten lightweight kernel, the co-development of the SMARTMAP intra-node memory mapping technique, and the development and demonstration of a scalable virtualization environment for HPC. Each of these topics is presented in this report by the inclusion of a published or submitted research paper. The results of this project are being leveraged by several ongoing and new research projects.

More Details

TYPE SAND Report YEAR 2010

DOI OSTI

Opportunities and approaches for system software in supporting application/architecture co-design

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Challenges for High-Performance Networking for Exascale Computing

2010 Proceedings of 19th International Conference on Computer Communications and Networks

Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI DOI

Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Lange, John; Pedretti, Kevin T.T.; Hudson, Trammell; Dinda, Peter; Cui, Zheng; Xia, Lei; Bridges, Patrick; Gocke, Andy; Jaconette, Steven; Levenhagen, Michael; Brightwell, Ronald B.

Palacios is a new open-source VMM under development at Northwestern University and the University of New Mexico that enables applications executing in a virtualized environment to achieve scalable high performance on large machines. Palacios functions as a modularized extension to Kitten, a high performance operating system being developed at Sandia National Laboratories to support large-scale supercomputing applications. Together, Palacios and Kitten provide a thin layer over the hardware to support full-featured virtualized environments alongside Kitten's lightweight native environment. Palacios supports existing, unmodified applications and operating systems by using the hardware virtualization technologies in recent AMD and Intel processors. Additionally, Palacios leverages Kitten's simple memory management scheme to enable low-overhead pass-through of native devices to a virtualized environment. We describe the design, implementation, and integration of Palacios and Kitten. Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised. © 2010 IEEE.

More Details

TYPE Conference YEAR 2010

Scopus OSTI