Page 3 – Center for Computing Research (CCR)

Early experiences with node-level power capping on the cray XC40 platform

Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen; Shu, Wei

Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at maximum power. This is not sustainable. Future systems will need to manage power as a critical resource, directing it to where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1.) Maximum performance requires ensuring the cap is not reached; 2.) Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3.) Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI DOI

A checkpoint compression study for high-performance computing systems

International Journal of High Performance Computing Applications

Ibtesham, Dewan; Ferreira, Kurt B.; Arnold, Dorian

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

More Details

TYPE Journal Article YEAR 2015

Scopus OSTI DOI

Early Experiences with Node-Level Power Capping on the Cray XC40 Platform

Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen S.; Shu, Wei S.

Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at ''maximum power''. This is not sustainable. Future systems will need to manage power as a critical resource, directing where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1. Maximum performance requires ensuring the cap is not reached; 2. Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3. Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

A principled approach to HPC event monitoring

FTXS 2015 - Proceedings of the 2015 Workshop on Fault Tolerance for HPC at eXtreme Scale, Part of HPDC 2015

Goudarzi, Alireza; Arnold, Dorian; Stefanovic, Darko; Ferreira, Kurt B.; Feldman, Guy

As high-performance computing (HPC) systems become larger and more complex, fault tolerance becomes a greater concern. At the same time, the data volume collected to help in understanding and mitigating hardware and software faults and failures also becomes prohibitively large. We argue that the HPC community must adopt more systematic approaches to system event logging as opposed to the current, ad hoc, strategies based on practitioner intuition and experience. Specifically, we show that event correlation and prediction can increase our understanding of fault behavior and can become critical components of effective fault tolerance strategies. While event correlation and prediction have been used in HPC contexts, we offer new insights about their potential capabilities. Using event logs from the computer failure data repository (cfdr) (1) we use cross and partial correlations to observe conditional correlations in HPC event data; (2) we use information theory to understand the fundamental predictive power of HPC failure data; (3) we study neural networks for failure prediction; and (4) finally, we use principal component analysis to understand to what extent dimensionality reduction can apply to HPC event data. This work results in the following insights that can inform HPC event monitoring: ad hoc correlations or ones based on direct correlations can be deficient or even misleading; highly accurate failure prediction may only require small windows of failure event history; and principal component analysis can significantly reduce HPC event data without loss of relevant information.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

Panel: What is a Lightweight Kernel?

Riesen, Rolf R.; Maccabe, Barney M.; Gerofi, Balazs G.; Lombard, David L.; Lange, John L.; Pedretti, Kevin P.; Ferreira, Kurt B.; Lang, Mike L.; Keppel, Pardo K.; Wisniewski, Robert W.; Brightwell, Ronald B.; Inglett, Todd I.; Park, Yoonho P.; Ishikawa, Yutaka I.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Exploring MPI Application Performance Under Power Capping on the Cray XC40 Platform

Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen S.; Shu, Wei S.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Similarity Engine: Using Content Similarity to Improve Memory Resilience

Levy, Scott L.; Ferreira, Kurt B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Dynamic Task Scheduling to Mitigate System Performance Variability

Shipman, Galen S.; McCormick, Patrick M.; Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Chen, Jacqueline H.; Sankaran, Ramanan S.; Treichler, Sean T.; Aiken, Alex A.; Bauer, Michael B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Revisiting Checkpointing for Exascale-Class Systems

Ferreira, Kurt B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Memory errors in modern systems: The good, the bad, and the ugly

International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS

Sridharan, Vilas; DeBardeleben, Nathan; Blanchard, Sean; Ferreira, Kurt B.; Stearley, Jon; Shalf, John; Gurumurthi, Sudhanva

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently-deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

More Details

TYPE Journal Article YEAR 2015

Scopus OSTI DOI

Canaries in a coal mine: Using application-level checkpoints to detect memory failures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt B.; Levy, Scott; Fabian, Nathan D.

Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failues.

More Details

TYPE Conference Poster YEAR 2015

Scopus OSTI

Using Machine Learning to Optimize Uncoordinated Checkpointing Performance

Ferreira, Kurt B.; Levy, Scott L.; Widener, Patrick W.; Arnold, Dorian A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Fault Survivability of Lightweight Operating Systems for exascale

Ferreira, Kurt B.

Concern is growing in the High-Performance Computing community regarding the reliability of proposed exascale systems. Current research has shown that the expected reliability of these machines will greatly reduce their scalability. In constrast to current fault tolerance methods whose reliability focus is only the application, this project investigates the benefits integrating reliability mechcanisms in the operating system and runtime, as well as the appli- cation. More specifically, this project has three broad contributions in the field: First, using failure logs from current leadership-class high-performance computing systems, we outline the failures common on these large-scale systems. Second, we describe a novel memory pro- tection mechcanism capable of protecting common observed failures that uses the similarity inherrant in many OS and applications state, thereby reducing overheads. Finally, using an analogy with OS jitter, we develop a highly effecient simulator capable predicting the performance of resilience methods at the scales expected for future extreme-scale systems.

More Details

TYPE SAND Report YEAR 2014

OSTI DOI

Publications