Publications

Results 1–25 of 113
Skip to search filters

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Pedretti, Kevin P.; Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

An Adaptive Core-Specific Runtime for Energy Efficiency

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017

Bhalachandra, Sridutt; Porterfield, Allan; Olivier, Stephen L.; Prins, Jan F.

Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.

More Details

Approaches for task affinity in OpenMP

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Terboven, Christian; Hahnfeld, Jonas; Teruel, Xavier; Mateo, Sergi; Duran, Alejandro; Klemm, Michael; Olivier, Stephen L.; de Supinski, Bronis R.

OpenMP tasking supports parallelization of irregular algorithms. Recent OpenMP specifications extended tasking to increase functionality and to support optimizations, for instance with the taskloop construct. However, task scheduling remains opaque, which leads to inconsistent performance on NUMA architectures. We assess design issues for task affinity and explore several approaches to enable it. We evaluate these proposals with implementations in the Nanos++ and LLVM OpenMP runtimes that improve performance up to 40% and significantly reduce execution time variation.

More Details

Assessing task-to-data affinity in the LLVM OpenMP runtime

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Klinkenberg, Jannis; Samfass, Philipp; Terboven, Christian; Duran, Alejandro; Klemm, Michael; Teruel, Xavier; Mateo, Sergi; Olivier, Stephen L.; Müller, Matthias S.

In modern shared-memory NUMA systems which typically consist of two or more multi-core processor packages with local memory, affinity of data to computation is crucial for achieving high performance with an OpenMP program. OpenMP* 3.0 introduced support for task-parallel programs in 2008 and has continued to extend its applicability and expressiveness. However, the ability to support data affinity of tasks is missing. In this paper, we investigate several approaches for task-to-data affinity that combine locality-aware task distribution and task stealing. We introduce the task affinity clause that will be part of OpenMP 5.0 and provide the reasoning behind its design. Evaluation with our experimental implementation in the LLVM OpenMP runtime shows that task affinity improves execution performance up to 4.5x on an 8-socket NUMA machine and significantly reduces runtime variability of OpenMP tasks. Our results demonstrate that a variety of applications can benefit from task affinity and that the presented clause is closing the gap of task-to-data affinity in OpenMP 5.0.

More Details

Cache Oblivious Strategies to Exploit Multi-Level Memory on Manycore Systems

Proceedings of MCHPC 2020: Workshop on Memory Centric High Performance Computing, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Butcher, Neil A.; Olivier, Stephen L.; Kogge, Peter M.

Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.

More Details

Cactus Environment Machine: Shared Environment Call-by-Need

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Stelle, George; Stefanovic, Darko; Olivier, Stephen L.; Forrest, Stephanie

Existing machines for lazy evaluation use a flat representation of environments, storing the terms associated with free variables in an array. Combined with a heap, this structure supports the shared intermediate results required by lazy evaluation. We propose and describe an alternative approach that uses a shared environment to minimize the overhead of delayed computations. We show how a shared environment can act as both an environment and a mechanism for sharing results. To formalize this approach, we introduce a calculus that makes the shared environment explicit, as well as a machine to implement the calculus, the Cactus Environment Machine. A simple compiler implements the machine and is used to run experiments for assessing performance. The results show reasonable performance and suggest that incorporating this approach into real-world compilers could yield performance benefits in some scenarios.

More Details
Results 1–25 of 113
Results 1–25 of 113