Publications

Results 26–50 of 68

Search results

Jump to search filters

Physics-Informed Machine Learning for DRAM Error Modeling

2018 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2018

Baseman, Elisabeth; Debardeleben, Nathan; Blanchard, Sean; Moore, Juston; Tkachenko, Olena; Ferreira, Kurt B.; Siddiqua, Taniya; Sridharan, Vilas

As the scale of high performance computing facilities approaches the exascale era, gaining a detailed understanding of hardware failures becomes important. In particular, the extreme memory capacity of modern supercomputers means that data corruption errors which were statistically negligible at smaller scales will become more prevalent. In order to understand hardware faults and mitigate their adverse effects on exascale workloads, we must learn from the behavior of current hardware. In this work, we investigate the predictability of DRAM errors using field data from two recently decommissioned supercomputers: Cielo, at Los Alamos National Laboratory, and Hopper, at Lawrence Berkeley National Laboratory. Due to the volume and complexity of the field data, we apply statistical machine learning to predict the probability of DRAM errors at previously un-accessed locations. We compare the predictive performance of six machine learning algorithms, and find that a model incorporating physical knowledge of DRAM spatial structure outperforms purely statistical methods. Our findings both support expected physical behavior of DRAM hardware as well as providing a mechanism for real-time error prediction. We demonstrate real-world feasibility by training an error model on one supercomputer and effectively predicting errors on another. Our methods demonstrate the importance of spatial locality over temporal locality in DRAM errors, and show that relatively simple statistical models are effective at predicting future errors based on historical data, allowing proactive error mitigation.

More Details

It’s not the heat, it’s the humidity: Scheduling resilience activity at scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt B.; Levy, Scott L.

Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.

More Details

Characterizing MPI matching via trace-based simulation

ACM International Conference Proceeding Series

Ferreira, Kurt B.; Levy, Scott L.; Laros, James H.; Grant, Ryan E.

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application’s execution, including its matching performance, and is highly dependent on the MPI library’s matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

Aggregating millions of hardware components to construct an exascale computing platform will pose significant resilience challenges. In addition to slowdowns associated with detected errors, silent errors are likely to further degrade application performance. Moreover, silent data corruption (SDC) has the potential to undermine the integrity of the results produced by important scientific applications.In this paper, we propose an application-independent mechanism to efficiently detect and correct SDC in read-mostly memory, where SDC may be most likely to occur. We use memory protection mechanisms to maintain compressed backups of application memory. We detect SDC by identifying changes in memory contents that occur without explicit write operations. We demonstrate that, for several applications, our approach can potentially protect a significant fraction of application memory pages from SDC with modest overheads. Moreover, our proposed technique can be straightforwardly combined with many other approaches to provide a significant bulwark against SDC.

More Details

Automating DRAM Fault Mitigation by Learning from Experience

Proceedings - 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017

Baseman, Elisabeth; Debardeleben, Nathan; Ferreira, Kurt B.; Sridharan, Vilas; Siddiqua, Taniya; Tkachenko, Olena

Current practice for mitigating DRAM hardwarefaults is to simply discard the entire faulty DIMM. However, this becomes increasingly expensive and wasteful as the priceof memory hardware increases and moves physically closer toprocessing units. Accurately characterizing memory faults inreal-time in order to pre-empt future potentially catastrophicfailures is crucial to conserving resources by blacklisting smallaffected regions of memory rather than discarding an entirehardware component. We further evaluate and extend a machinelearning method for DRAM fault characterization introduced inprior work by Baseman et al. at Los Alamos National Laboratory. We report on the usefulness of a variety of training sets, usinga set of production-relevant metrics to evaluate the method ondata from a leadership-class supercomputing facility. We observean increase in percent of faults successfully mitigated as well asa decrease in percent of wasted blacklisted pages, regardless oftraining set, when using the learned algorithm as compared to ahuman-expert, deterministic, and rule-based approach.

More Details

Lifetime memory reliability data from the field

2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017

Siddiqua, Taniya; Sridharan, Vilas; Raasch, Steven E.; Debardeleben, Nathan; Ferreira, Kurt B.; Levy, Scott L.; Baseman, Elisabeth; Guan, Qiang

In order to provide high system resilience, it is important to understand the nature of the faults that occur in the field. This study analyzes fault rates from a production system that has been monitored for five years, capturing data for the entire operational lifetime of the system. The data show that devices in this system did not show any sign of aging during the monitoring period, suggesting that the lifetime of a system may be longer than five years. In DRAM, the relative incidence of fault modes changed insignificantly over the system's lifetime: The relative rate of each fault mode at the end of the system's lifetime was within 1.4 percentage point of the rate observed during the first year. SRAM caches in the system exhibited different fault modes including cache-way fault and single-bit faults. Overall, this study provides insights on how fault modes and types in a system evolve over the system's lifetime.

More Details

Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt B.; Levy, Scott L.

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.

More Details

How I learned to stop worrying and love in situ analytics: Leveraging latent synchronization in MPI collective algorithms

ACM International Conference Proceeding Series

Levy, Scott L.; Ferreira, Kurt B.; Widener, Patrick W.; Bridges, Patrick G.; Mondragon, Oscar H.

Scientific workloads running on current extreme-scale systems routinely generate tremendous volumes of data for postprocessing. This data movement has become a serious issue due to its energy cost and the fact that I/O bandwidths have not kept pace with data generation rates. In situ analytics is an increasingly popular alternative in which post-simulation processing is embedded into an application, running as part of the same MPI job. This can reduce data movement costs but introduces a new potential source of interference for the application. Using a validated simulation-based approach, we investigate how best to mitigate the interference from time-shared in situ tasks for a number of key extreme-scale workloads. This paper makes a number of contributions. First, we show that the independent scheduling of in situ analytics tasks can significantly degradation application performance, with slowdowns exceeding 1000%. Second, we demonstrate that the degree of synchronization found in many modern collective algorithms is sufficient to significantly reduce the overheads of this interference to less than 10% in most cases. Finally, we show that many applications already frequently invoke collective operations that use these synchronizing MPI algorithms. Therefore, the syncronization introduced by these MPI collective algorithms can be leveraged to efficiently schedule analytics tasks with minimal changes to existing applications. This paper provides critical analysis and guidance for MPI users and developers on the importance of scheduling in situ analytics tasks. It shows the degree of synchronization needed to mitigate the performance impacts of these time-shared coupled codes and demonstrates how that synchronization can be realized in an extreme-scale environment using modern collective algorithms.

More Details

Improving DRAM Fault Characterization through Machine Learning

Proceedings - 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN-W 2016

Baseman, Elisabeth; Debardeleben, Nathan; Ferreira, Kurt B.; Levy, Scott L.; Raasch, Steven; Sridharan, Vilas; Siddiqua, Taniya; Guan, Qiang

As high-performance computing systems continue to grow in scale and complexity, the study of faults and errors is critical to the design of future systems and mitigation schemes. Fault modes in system DRAM are a frequently-investigated key aspect of memory reliability. While current schemes require offline analysis for proper classification, current state-of-the-art mitigation techniques require accurate online prediction for optimal performance. In this work, we explore the predictive performance of an online machine learning-based approach in classifying DRAM fault modes from two leadership-class supercomputing facilities. Our results compare the predictive performance of this online approach with the current rule-based approach based on expert knowledge, finding a 12% predictive performance improvement. We also investigate the universality of our classifiers by evaluating predictive performance using training data from disparate computing systems to achieve a 7% improvement in predictive performance. Our work provides a critical analysis of this online learning technique and can benefit system designers to help inform best practices for dealing with reliability on future systems.

More Details

Improving Application Resilience to Memory Errors with Lightweight Compression

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

In next-generation extreme-scale systems, application performance will be limited by memory performance characteristics. The first exascale system is projected to contain many petabytes of memory. In addition to the sheer volume of the memory required, device trends, such as shrinking feature sizes and reduced supply voltages, have the potential to increase the frequency of memory errors. As a result, resilience to memory errors is a key challenge. In this paper, we evaluate the viability of using memory compression to repair detectable uncorrectable errors (DUEs) in memory. We develop a software library, evaluate its performance and demonstrate that it is able to significantly compress memory of HPC applications. Further, we show that exploiting compressed memory pages to correct memory errors can significantly improve application performance on next-generation systems.

More Details

Mini-Ckpts: Surviving OS failures in persistent memory

Proceedings of the International Conference on Supercomputing

Fiala, David; Mueller, Frank; Ferreira, Kurt B.; Engelmann, Christian

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme- scale systems. Current efforts have focused on appli- cation fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs-and the parallel na- ture of HPC applications means any single node failure generally forces all processes of this application to ter- minate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. minickpts achieves this tolerance by ensuring that the crit- ical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the applica- tion continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.

More Details

An examination of the impact of failure distribution on coordinated checkpoint/restart

FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Levy, Scott L.; Ferreira, Kurt B.

Fault tolerance is a key challenge to building the first exascale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

More Details
Results 26–50 of 68
Results 26–50 of 68