Publications Search

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Aggregating millions of hardware components to construct an exascale computing platform will pose significant resilience challenges. In addition to slowdowns associated with detected errors, silent errors are likely to further degrade application performance. Moreover, silent data corruption (SDC) has the potential to undermine the integrity of the results produced by important scientific applications.In this paper, we propose an application-independent mechanism to efficiently detect and correct SDC in read-mostly memory, where SDC may be most likely to occur. We use memory protection mechanisms to maintain compressed backups of application memory. We detect SDC by identifying changes in memory contents that occur without explicit write operations. We demonstrate that, for several applications, our approach can potentially protect a significant fraction of application memory pages from SDC with modest overheads. Moreover, our proposed technique can be straightforwardly combined with many other approaches to provide a significant bulwark against SDC.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Modeling Concurrent Point-to-Point Communication Cost in MPI Performance Models

Farmer, Shane; Skjellum, Anthony; Bridges, Patrick G.; Dosanjh, Matthew G.; Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Improving Application Resilience to Memory Errors with Lightweight Compression

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms

ACM International Conference Proceeding Series

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Scientific workloads running on current extreme-scale systems routinely generate tremendous volumes of data for postprocessing. This data movement has become a serious issue due to its energy cost and the fact that I/O bandwidths have not kept pace with data generation rates. In situ analytics is an increasingly popular alternative in which post-simulation processing is embedded into an application, running as part of the same MPI job. This can reduce data movement costs but introduces a new potential source of interference for the application. Using a validated simulation-based approach, we investigate how best to mitigate the interference from time-shared in situ tasks for a number of key extreme-scale workloads. This paper makes a number of contributions. First, we show that the independent scheduling of in situ analytics tasks can significantly degradation application performance, with slowdowns exceeding 1000%. Second, we demonstrate that the degree of synchronization found in many modern collective algorithms is sufficient to significantly reduce the overheads of this interference to less than 10% in most cases. Finally, we show that many applications already frequently invoke collective operations that use these synchronizing MPI algorithms. Therefore, the syncronization introduced by these MPI collective algorithms can be leveraged to efficiently schedule analytics tasks with minimal changes to existing applications. This paper provides critical analysis and guidance for MPI users and developers on the importance of scheduling in situ analytics tasks. It shows the degree of synchronization needed to mitigate the performance impacts of these time-shared coupled codes and demonstrates how that synchronization can be realized in an extreme-scale environment using modern collective algorithms.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Improving Application Resilience to Memory Errors with Lightweight Compression

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

In next-generation extreme-scale systems, application performance will be limited by memory performance characteristics. The first exascale system is projected to contain many petabytes of memory. In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to increase the frequency of memory errors. As a result, resilience to memory errors is a key challenge. In this paper, we evaluate the viability of using memory compression to repair detectable uncorrectable errors (DUEs) in memory. We develop a software library, evaluate its performance and demonstrate that it is able to significantly compress memory of HPC applications. Further, we show that exploiting compressed memory pages to correct memory errors can significantly improve application performance on next-generation systems.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Understanding Performance Interference in Next-Generation HPC Systems

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

XASM: A Cross-Enclave Composition Mechanism for Exascale System Software

Evans, Noah; Foulk, James W.; Lange, John R.; Kocoloski, Brian; Bridges, Patrick G.; Michael, Lang

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

A cross-enclave composition mechanism for exascale system software

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2016 - In conjunction with HPDC 2016

Evans, Noah; Foulk, James W.; Kocoloski, Brian; Lange, John R.; Lang, Michael; Bridges, Patrick G.

As supercomputers move to exascale, the number of cores per node continues to increase, but the I/O bandwidth between nodes is increasing more slowly. This leads to computational power outstripping I/O bandwidth. This growth, in turn, encourages moving as much of an HPC workflow as possible onto the node in order to minimize data movement. One particular method of application composition, enclaves, co-locates different operating systems and runtimes on the same node where they communicate by in situ communication mechanisms. In this work, we describe a mechanism for communicating between composed applications. We implement a mechanism using Copy onWrite cooperating with XEMEM shared memory to provide consistent, implicitly unsynchronized communication across enclaves. We then evaluate this mechanism using a composed application and analytics between the Kitten Lightweight Kernel and Linux on top of the Hobbes Operating System and Runtime. These results show a 3% overhead compared to an application running in isolation, demonstrating the viability of this approach.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

A cross-enclave composition mechanism for exascale system software

Proceedings of the 6th International Workshop on Runtime and Operating Systems for Supercomputers Ross 2016 in Conjunction with Hpdc 2016

Evans, Noah; Foulk, James W.; Kocoloski, Brian; Lange, John R.; Lang, Michael; Bridges, Patrick G.

As supercomputers move to exascale, the number of cores per node continues to increase, but the I/O bandwidth between nodes is increasing more slowly. This leads to computational power outstripping I/O bandwidth. This growth, in turn, encourages moving as much of an HPC workflow as possible onto the node in order to minimize data movement. One particular method of application composition, enclaves, co-locates different operating systems and runtimes on the same node where they communicate by in situ communication mechanisms. In this work, we describe a mechanism for communicating between composed applications. We implement a mechanism using Copy onWrite cooperating with XEMEM shared memory to provide consistent, implicitly unsynchronized communication across enclaves. We then evaluate this mechanism using a composed application and analytics between the Kitten Lightweight Kernel and Linux on top of the Hobbes Operating System and Runtime. These results show a 3% overhead compared to an application running in isolation, demonstrating the viability of this approach.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

DOI OSTI

RMA-MT: A Benchmark Suite for Assessing MPI Multi-threaded RMA Performance

Dosanjh, Matthew G.; Groves, Taylor L.; Grant, Ryan; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

SHMEM-MT: A benchmark suite for assessing multi-threaded SHMEM performance

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Weeks, Hans; Dosanjh, Matthew G.; Bridges, Patrick G.; Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Scheduling In-Situ Analytics in Next-generation Applications

Mondragon, Oscar H.; Bridges, Patrick G.; Ferreira, Kurt; Widener, Patrick; Levy, Scott L.N.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Preparing for Exascale: Modeling MPI for Many-Core Systems using Fine-Grain Queues

Bridges, Patrick G.; Dosanjh, Matthew G.; Grant, Ryan; Farmer, Shane; Skjellum, Anthony; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

System-Level Support for Composition of Applications

Lofstead, Gerald F.; Foulk, James W.; Kocoloski, Brian; Lange, John R.; Abbasi, Hasan; Bernholdt, David; Jones, Terry; Dayal, Jai; Evans, Noah; Lang, Michael; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Minimal-overhead virtualization of a large scale supercomputer

ACM SIGPLAN Notices

Lange, John R.; Pedretti, Kevin P.; Dinda, Peter; Bae, Chang; Bridges, Patrick G.; Soltero, Philip; Merritt, Alexander

Virtualization has the potential to dramatically increase the usability and reliability of high performance computing (HPC) systems. However, this potential will remain unrealized unless overheads can be minimized. This is particularly challenging on large scale machines that run carefully crafted HPC OSes supporting tightlycoupled, parallel applications. In this paper, we show how careful use of hardware and VMM features enables the virtualization of a large-scale HPC system, specifically a Cray XT4 machine, with .5% overhead on key HPC applications, microbenchmarks, and guests at scales of up to 4096 nodes. We describe three techniques essential for achieving such low overhead: passthrough I/O, workload-sensitive selection of paging mechanisms, and carefully controlled preemption. These techniques are forms of symbiotic virtualization, an approach on which we elaborate. Copyright © 2011 ACM.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

VM-based Slack Emulation of Large-scale Systems

Bridges, Patrick G.; Pedretti, Kevin

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Evaluating the Viability of Process Replication Reliability for Exascale Systems

Ferreira, Kurt; Stearley, Jon S.; Laros, James H.; Oldfield, Ron; Pedretti, Kevin P.; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Opportunities for Leveraging OS Virtualization in High-End Supercomputing

Pedretti, Kevin; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Presentation YEAR 2010

OSTI

Opportunities for leveraging OS virtualization in high-end supercomputing

Pedretti, Kevin P.; Bridges, Patrick G.

This paper examines potential motivations for incorporating virtualization support in the system software stacks of high-end capability supercomputers. We advocate that this will increase the flexibility of these platforms significantly and enable new capabilities that are not possible with current fixed software stacks. Our results indicate that compute, virtual memory, and I/O virtualization overheads are low and can be further mitigated by utilizing well-known techniques such as large paging and VMM bypass. Furthermore, since the addition of virtualization support does not affect the performance of applications using the traditional native environment, there is essentially no disadvantage to its addition.

More Details

TYPE Conference YEAR 2010

OSTI

Publications

Search results