Publications Search

Improving Application Resilience to Memory Errors with Lightweight Compression

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Levy, Scott L.N.; Ferreira, Kurt; Bridges, Patrick G.

In next-generation extreme-scale systems, application performance will be limited by memory performance characteristics. The first exascale system is projected to contain many petabytes of memory. In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to increase the frequency of memory errors. As a result, resilience to memory errors is a key challenge. In this paper, we evaluate the viability of using memory compression to repair detectable uncorrectable errors (DUEs) in memory. We develop a software library, evaluate its performance and demonstrate that it is able to significantly compress memory of HPC applications. Further, we show that exploiting compressed memory pages to correct memory errors can significantly improve application performance on next-generation systems.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Understanding Performance Interference in Next-Generation HPC Systems

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

An examination of the impact of failure distribution on coordinated checkpoint/restart

FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Levy, Scott L.N.; Ferreira, Kurt

Fault tolerance is a key challenge to building the first exascale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

DOI OSTI

Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Levy, Scott L.N.

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 10¹⁸ oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.

More Details

TYPE Thesis or Dissertation YEAR 2016

OSTI

On noise and the performance benefit of nonblocking collectives

International Journal of High Performance Computing Applications

Widener, Patrick; Levy, Scott L.N.; Ferreira, Kurt; Hoefler, Torsten

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus