Publications Search

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application’s execution, including its matching performance, and is highly dependent on the MPI library’s matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Aggregating millions of hardware components to construct an exascale computing platform will pose significant resilience challenges. In addition to slowdowns associated with detected errors, silent errors are likely to further degrade application performance. Moreover, silent data corruption (SDC) has the potential to undermine the integrity of the results produced by important scientific applications.In this paper, we propose an application-independent mechanism to efficiently detect and correct SDC in read-mostly memory, where SDC may be most likely to occur. We use memory protection mechanisms to maintain compressed backups of application memory. We detect SDC by identifying changes in memory contents that occur without explicit write operations. We demonstrate that, for several applications, our approach can potentially protect a significant fraction of application memory pages from SDC with modest overheads. Moreover, our proposed technique can be straightforwardly combined with many other approaches to provide a significant bulwark against SDC.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

ATDM Data Warehouse: Data Management Services for Exascale Computing

Ulmer, Craig; Oldfield, Ron; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Templet, Gary J.; Widener, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Lifetime memory reliability data from the field

2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017

Siddiqua, Taniya; Sridharan, Vilas; Raasch, Steven E.; Debardeleben, Nathan; Ferreira, Kurt; Levy, Scott L.N.; Baseman, Elisabeth; Guan, Qiang

In order to provide high system resilience, it is important to understand the nature of the faults that occur in the field. This study analyzes fault rates from a production system that has been monitored for five years, capturing data for the entire operational lifetime of the system. The data show that devices in this system did not show any sign of aging during the monitoring period, suggesting that the lifetime of a system may be longer than five years. In DRAM, the relative incidence of fault modes changed insignificantly over the system's lifetime: The relative rate of each fault mode at the end of the system's lifetime was within 1.4 percentage point of the rate observed during the first year. SRAM caches in the system exhibited different fault modes including cache-way fault and single-bit faults. Overall, this study provides insights on how fault modes and types in a system evolve over the system's lifetime.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ferreira, Kurt; Levy, Scott L.N.

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

ATDM Data Warehouse

Ulmer, Craig; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Sjaardema, Gregory D.; Templet, Gary J.; Widener, Patrick; Oldfield, Ron

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Improving Application Resilience to Memory Errors with Lightweight Compression

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms

ACM International Conference Proceeding Series

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Scientific workloads running on current extreme-scale systems routinely generate tremendous volumes of data for postprocessing. This data movement has become a serious issue due to its energy cost and the fact that I/O bandwidths have not kept pace with data generation rates. In situ analytics is an increasingly popular alternative in which post-simulation processing is embedded into an application, running as part of the same MPI job. This can reduce data movement costs but introduces a new potential source of interference for the application. Using a validated simulation-based approach, we investigate how best to mitigate the interference from time-shared in situ tasks for a number of key extreme-scale workloads. This paper makes a number of contributions. First, we show that the independent scheduling of in situ analytics tasks can significantly degradation application performance, with slowdowns exceeding 1000%. Second, we demonstrate that the degree of synchronization found in many modern collective algorithms is sufficient to significantly reduce the overheads of this interference to less than 10% in most cases. Finally, we show that many applications already frequently invoke collective operations that use these synchronizing MPI algorithms. Therefore, the syncronization introduced by these MPI collective algorithms can be leveraged to efficiently schedule analytics tasks with minimal changes to existing applications. This paper provides critical analysis and guidance for MPI users and developers on the importance of scheduling in situ analytics tasks. It shows the degree of synchronization needed to mitigate the performance impacts of these time-shared coupled codes and demonstrates how that synchronization can be realized in an extreme-scale environment using modern collective algorithms.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Improving DRAM Fault Characterization through Machine Learning

Proceedings - 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016

Baseman, Elisabeth; Debardeleben, Nathan; Ferreira, Kurt; Levy, Scott L.N.; Raasch, Steven; Sridharan, Vilas; Siddiqua, Taniya; Guan, Qiang

As high-performance computing systems continue to grow in scale and complexity, the study of faults and errors is critical to the design of future systems and mitigation schemes. Fault modes in system DRAM are a frequently-investigated key aspect of memory reliability. While current schemes require offline analysis for proper classification, current state-of-the-art mitigation techniques require accurate online prediction for optimal performance. In this work, we explore the predictive performance of an online machine learning-based approach in classifying DRAM fault modes from two leadership-class supercomputing facilities. Our results compare the predictive performance of this online approach with the current rule-based approach based on expert knowledge, finding a 12% predictive performance improvement. We also investigate the universality of our classifiers by evaluating predictive performance using training data from disparate computing systems to achieve a 7% improvement in predictive performance. Our work provides a critical analysis of this online learning technique and can benefit system designers to help inform best practices for dealing with reliability on future systems.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Improving Application Resilience to Memory Errors with Lightweight Compression

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

In next-generation extreme-scale systems, application performance will be limited by memory performance characteristics. The first exascale system is projected to contain many petabytes of memory. In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to increase the frequency of memory errors. As a result, resilience to memory errors is a key challenge. In this paper, we evaluate the viability of using memory compression to repair detectable uncorrectable errors (DUEs) in memory. We develop a software library, evaluate its performance and demonstrate that it is able to significantly compress memory of HPC applications. Further, we show that exploiting compressed memory pages to correct memory errors can significantly improve application performance on next-generation systems.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Understanding Performance Interference in Next-Generation HPC Systems

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

An examination of the impact of failure distribution on coordinated checkpoint/restart

FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Levy, Scott L.N.; Ferreira, Kurt

Fault tolerance is a key challenge to building the first exascale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

DOI OSTI

Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Levy, Scott L.N.

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 10¹⁸ oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.

More Details

TYPE Thesis or Dissertation YEAR 2016

OSTI

On noise and the performance benefit of nonblocking collectives

International Journal of High Performance Computing Applications

Widener, Patrick; Levy, Scott L.N.; Ferreira, Kurt; Hoefler, Torsten

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Scheduling In-Situ Analytics in Next-generation Applications

Mondragon, Oscar H.; Bridges, Patrick G.; Ferreira, Kurt; Widener, Patrick; Levy, Scott L.N.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Canaries in a coal mine: Using application-level checkpoints to detect memory failures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ferreira, Kurt; Levy, Scott L.N.; Fabian, Nathan

Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failues.

More Details

TYPE Conference Poster YEAR 2015

OSTI Scopus

Using Machine Learning to Optimize Uncoordinated Checkpointing Performance

Ferreira, Kurt; Levy, Scott L.N.; Widener, Patrick; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Publications

Search results