Publications

Results 76–95 of 95
Skip to search filters

Redundant computing for exascale systems

Ferreira, Kurt; Stearley, Jon S.; Oldfield, Ron A.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Exascale systems will have hundred thousands of compute nodes and millions of components which increases the likelihood of faults. Today, applications use checkpoint/restart to recover from these faults. Even under ideal conditions, applications running on more than 50,000 nodes will spend more than half of their total running time saving checkpoints, restarting, and redoing work that was lost. Redundant computing is a method that allows an application to continue working even when failures occur. Instead of each failure causing an application interrupt, multiple failures can be absorbed by the application until redundancy is exhausted. In this paper we present a method to analyze the benefits of redundant computing, present simulation results of the cost, and compare it to other proposed methods for fault resilience.

More Details

Managing variability in the IO performance of petascale storage systems

Oldfield, Ron A.

Significant challenges exist for achieving peak or even consistent levels of performance when using IO systems at scale. They stem from sharing IO system resources across the processes of single large-scale applications and/or multiple simultaneous programs causing internal and external interference, which in turn, causes substantial reductions in IO performance. This paper presents interference effects measurements for two different file systems at multiple supercomputing sites. These measurements motivate developing a 'managed' IO approach using adaptive algorithms varying the IO system workload based on current levels and use areas. An implementation of these methods deployed for the shared, general scratch storage system on Oak Ridge National Laboratory machines achieves higher overall performance and less variability in both a typical usage environment and with artificially introduced levels of 'noise'. The latter serving to clearly delineate and illustrate potential problems arising from shared system usage and the advantages derived from actively managing it.

More Details

On the path to exascale

International Journal of Distributed Systems and Technologies

Alvin, Kenneth F.; Barrett, Brian B.; Brightwell, Ronald B.; Dosanjh, Sudip S.; Geist, Al; Hemmert, Karl S.; Heroux, Michael; Kothe, Doug; Murphy, Richard C.; Nichols, Jeff; Oldfield, Ron A.; Rodrigues, Arun; Vetter, Jeffrey S.

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

More Details

Advanced I/O for large-scale scientific applications

Oldfield, Ron A.

As scientific simulations scale to use petascale machines and beyond, the data volumes generated pose a dual problem. First, with increasing machine sizes, the careful tuning of IO routines becomes more and more important to keep the time spent in IO acceptable. It is not uncommon, for instance, to have 20% of an application's runtime spent performing IO in a 'tuned' system. Careful management of the IO routines can move that to 5% or even less in some cases. Second, the data volumes are so large, on the order of 10s to 100s of TB, that trying to discover the scientifically valid contributions requires assistance at runtime to both organize and annotate the data. Waiting for offline processing is not feasible due both to the impact on the IO system and the time required. To reduce this load and improve the ability of scientists to use the large amounts of data being produced, new techniques for data management are required. First, there is a need for techniques for efficient movement of data from the compute space to storage. These techniques should understand the underlying system infrastructure and adapt to changing system conditions. Technologies include aggregation networks, data staging nodes for a closer parity to the IO subsystem, and autonomic IO routines that can detect system bottlenecks and choose different approaches, such as splitting the output into multiple targets, staggering output processes. Such methods must be end-to-end, meaning that even with properly managed asynchronous techniques, it is still essential to properly manage the later synchronous interaction with the storage system to maintain acceptable performance. Second, for the data being generated, annotations and other metadata must be incorporated to help the scientist understand output data for the simulation run as a whole, to select data and data features without concern for what files or other storage technologies were employed. All of these features should be attained while maintaining a simple deployment for the science code and eliminating the need for allocation of additional computational resources.

More Details

Lightweight storage and overlay networks for fault tolerance

Oldfield, Ron A.

The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-scale systems that execute applications which consume most of the memory of the employed compute nodes, checkpoint operations generate I/O that consumes nearly 80% of the total I/O usage. Motivated by this observation, this project aims to improve I/O performance for application-directed checkpoints through the use of lightweight storage architectures and overlay networks. Lightweight storage provide direct access to underlying storage devices. Overlay networks provide caching and processing capabilities in the compute-node fabric. The combination has potential to signifcantly reduce I/O overhead for large-scale applications. This report describes our combined efforts to model and understand overheads for application-directed checkpoints, as well as implementation and performance analysis of a checkpoint service that uses available compute nodes as a network cache for checkpoint operations.

More Details

Increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Oldfield, Ron A.; Stearley, Jon S.; Laros, James H.; Pedretti, Kevin P.; Brightwell, Ronald B.

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

More Details

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron A.; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details
Results 76–95 of 95
Results 76–95 of 95