Publications

Results 51–68 of 68

Search results

Jump to search filters

On noise and the performance benefit of nonblocking collectives

International Journal of High Performance Computing Applications

Widener, Patrick W.; Levy, Scott L.; Ferreira, Kurt B.; Hoefler, Torsten

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.

More Details

Early experiences with node-level power capping on the cray XC40 platform

Proceedings of E2SC 2015: 3rd International Workshop on Energy Efficient Supercomputing - Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Laros, James H.; Olivier, Stephen L.; Ferreira, Kurt B.; Shipman, Galen; Shu, Wei

Power consumption of extreme-scale supercomputers has become a key performance bottleneck. Yet current practices do not leverage power management opportunities, instead running at maximum power. This is not sustainable. Future systems will need to manage power as a critical resource, directing it to where it has greatest benefit. Power capping is one mechanism for managing power budgets, however its behavior is not well understood. This paper presents an empirical evaluation of several key HPC workloads running under a power cap on a Cray XC40 system, and provides a comparison of this technique with p-state control, demonstrating the performance differences of each. These results show: 1.) Maximum performance requires ensuring the cap is not reached; 2.) Performance slowdown under a cap can be attributed to cascading delays which result in unsynchronized performance variability across nodes; and, 3.) Due to lag in reaction time, considerable time is spent operating above the set cap. This work provides a timely and much needed comparison of HPC application performance under a power cap and attempts to enable users and system administrators to understand how to best optimize application performance on power-constrained HPC systems.

More Details

A checkpoint compression study for high-performance computing systems

International Journal of High Performance Computing Applications

Ferreira, Kurt B.; Arnold, Dorian; Ibtesham, Dewan

As high-performance computing systems continue to increase in size and complexity, higher failure rates and increased overheads for checkpoint/restart (CR) protocols have raised concerns about the practical viability of CR protocols for future systems. Previously, compression has proven to be a viable approach for reducing checkpoint data volumes and, thereby, reducing CR protocol overhead leading to improved application performance. In this article, we further explore compression-based CR optimization by exploring its baseline performance and scaling properties, evaluating whether improved compression algorithms might lead to even better application performance and comparing checkpoint compression against and alongside other software- and hardware-based optimizations. Our results highlights are that: (1) compression is a very viable CR optimization; (2) generic, text-based compression algorithms appear to perform near optimally for checkpoint data compression and faster compression algorithms will not lead to better application performance; (3) compression-based optimizations fare well against and alongside other software-based optimizations; and (4) while hardware-based optimizations outperform software-based ones, they are not as cost effective.

More Details

Memory errors in modern systems: The good, the bad, and the ugly

International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS

Sridharan, Vilas; Debardeleben, Nathan; Blanchard, Sean; Ferreira, Kurt B.; Gurumurthi, Sudhanva; Shalf, John

Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when deployed in high-performance computing systems and data centers containing tens of thousands of nodes. Therefore, it is critical to understand the efficacy of current hardware resilience techniques to determine whether they will be suitable for future systems. In this paper, we present a study of DRAM and SRAM faults and errors from the field. We use data from two leadership-class high-performance computer systems to analyze the reliability impact of hardware resilience schemes that are deployed in current systems. Our study has several key findings about the efficacy of many currently-deployed reliability techniques such as DRAM ECC, DDR address/command parity, and SRAM ECC and parity. We also perform a methodological study, and find that counting errors instead of faults, a common practice among researchers and data center operators, can lead to incorrect conclusions about system reliability. Finally, we use our data to project the needs of future large-scale systems. We find that SRAM faults are unlikely to pose a significantly larger reliability threat in the future, while DRAM faults will be a major concern and stronger DRAM resilience schemes will be needed to maintain acceptable failure rates similar to those found on today's systems.

More Details

Canaries in a coal mine: Using application-level checkpoints to detect memory failures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick W.; Ferreira, Kurt B.; Levy, Scott L.; Fabian, Nathan D.

Memory failures in future extreme scale applications are a significant concern in the high-performance computing community and have attracted much research attention. We contend in this paper that using application checkpoint data to detect memory failures has potential benefits and is preferable to examining application memory. To support this contention, we describe the application of machine learning techniques to evaluate the veracity of checkpoint data. Our preliminary results indicate that supervised decision tree machine learning approaches can effectively detect corruption in restart files, suggesting that future extreme-scale applications and systems may benefit from incorporating such approaches in order to cope with memory failues.

More Details

Fault Survivability of Lightweight Operating Systems for exascale

Ferreira, Kurt B.

Concern is growing in the High-Performance Computing community regarding the reliability of proposed exascale systems. Current research has shown that the expected reliability of these machines will greatly reduce their scalability. In constrast to current fault tolerance methods whose reliability focus is only the application, this project investigates the benefits integrating reliability mechcanisms in the operating system and runtime, as well as the appli- cation. More specifically, this project has three broad contributions in the field: First, using failure logs from current leadership-class high-performance computing systems, we outline the failures common on these large-scale systems. Second, we describe a novel memory pro- tection mechcanism capable of protecting common observed failures that uses the similarity inherrant in many OS and applications state, thereby reducing overheads. Finally, using an analogy with OS jitter, we develop a highly effecient simulator capable predicting the performance of resilience methods at the scales expected for future extreme-scale systems.

More Details
Results 51–68 of 68
Results 51–68 of 68