Page 3 – Center for Computing Research (CCR)

An Adaptive Core-Specific Runtime for Energy Efficiency

Proceedings - 2017 IEEE 31st International Parallel and Distributed Processing Symposium, IPDPS 2017

Bhalachandra, Sridutt; Porterfield, Allan; Olivier, Stephen L.; Prins, Jan F.

Energy efficiency in high performance computing (HPC) will be critical to limit operating costs and carbon footprints in future supercomputing centers. Energy efficiency of a computation can be improved by reducing time to completion without a substantial increase in power drawn or by reducing power with a little increase in time to completion. We present an Adaptive Core-specific Runtime (ACR) that dynamically adapts core frequencies to workload characteristics, and show examples of both reductions in power and improvement in the average performance. This improvement in energy efficiency is obtained without changes to the application. The adaptation policy embedded in the runtime uses existing core-specific power controls like software-controlled clock modulation and per-core Dynamic Voltage Frequency Scaling (DVFS) introduced in Intel Haswell. Experiments on six standard MPI benchmarks and a real world application show an overall 20% improvement in energy efficiency with less than 1% increase in execution time on 32 nodes (1024 cores) using per-core DVFS. An improvement in energy efficiency of up to 42% is obtained with the real world application ParaDis through a combination of speedup and power reduction. For one configuration, ParaDis achieves an average speedup of 11%, while the power is lowered by about 31%. The average improvement in the performance seen is a direct result of the reduction in run-to-run variation and running at turbo frequencies.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Scheduling Chapel tasks with Qthreads on manycore: A tale of two schedulers

Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2017 - In conjunction with HPDC

Evans, Noah; Olivier, Stephen L.; Barrett, Richard F.; Stelle, George

This paper describes improvements in task scheduling for the Chapel parallel programming language provided in its default on-node tasking runtime, the Qthreads library. We describe a new scheduler distrib which builds on the approaches of two previous Qthreads schedulers, Sherwood and Nemesis, and combines the best aspects of both-work stealing and load balancing from Sherwood and a lock free queue access from Nemesis- to make task queuing better suited for the use of Chapel in the manycore era. We demonstrate the efficacy of this new scheduler by showing improvements in various individual benchmarks of the Chapel test suite on the Intel Knights Landing architecture.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI DOI

Double Buffering for MCDRAM on Second Generation Intel Xeon Phi Processors with OpenMP

Olivier, Stephen L.; Hammond, Simon D.; Duran, Alejandro D.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Qthreads Node Threading Runtime and NoRMa Node Resource Manager: A HiHAT teaser

Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

High Performance Computing - Power Application Programming Interface Specification Version 2.0

Laros, James H.; Grant, Ryan E.; Levenhagen, Michael J.; Olivier, Stephen L.; Pedretti, Kevin P.; Ward, Harry L.; Younge, Andrew J.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2017

OSTI DOI

Qthreads and On-Node Run time Coordination

Olivier, Stephen L.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Double buffering for MCDRAM on second generation intel® Xeon Phi™ processors with OpenMP

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Olivier, Stephen L.; Hammond, Simon D.; Duran, Alejandro

Emerging novel architectures for shared memory parallel computing are incorporating increasingly creative innovations to deliver higher memory performance. A notable exemplar of this phenomenon is the Multi-Channel DRAM (MCDRAM) that is included in the Intel® XeonPhi™ processors. In this paper, we examine techniques to use OpenMP to exploit the high bandwidth of MCDRAM by staging data. In particular, we implement double buffering using OpenMP sections and tasks to explicitly manage movement of data into MCDRAM. We compare our double-buffered approach to a non-buffered implementation and to Intel’s cache mode, in which the system manages the MCDRAM as a transparent cache. We also demonstrate the sensitivity of performance to parameters such as dataset size and the distribution of threads between compute and copy operations.

More Details

TYPE Conference Poster YEAR 2017

Scopus OSTI

Enhancing Qthreads for ECP Science and Energy Impact And Sandia ATDM On-Node Runtime Coordination

Brightwell, Ronald B.; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics

2016 IEEE High Performance Extreme Computing Conference, HPEC 2016

Wolf, Michael W.; Edwards, Harold C.; Olivier, Stephen L.

The Graph BLAS effort to standardize a set of graph algorithms building blocks in terms of linear algebra primitives promises to deliver high performing graph algorithms and greatly impact the analysis of big data. However, there are challenges with this approach, which our data analytics miniapp miniTri exposes. In this paper, we improve upon a previously proposed task-parallel approach to linear algebra-based miniTri formulation, addressing these challenges and describing a Kokkos/Qthreads task-parallel implementation that performs as well or slightly better than the highly optimized, baseline OpenMP data-parallel implementation.

More Details

TYPE Conference Poster YEAR 2016

Scopus OSTI DOI

OpenMP Tasks: New Features for TR4

Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Standardizing Power Monitoring and Control at Exascale

Computer

Grant, Ryan E.; Levenhagen, Michael J.; Olivier, Stephen L.; DeBonis, David D.; Pedretti, Kevin P.; Laros, James H.

Power API - the result of collaboration among national laboratories, universities, and major vendors - provides a range of standardized power management functions, from application-level control and measurement to facility-level accounting, including real-time and historical statistics gathering. Support is already available for Intel and AMD CPUs and standalone measurement devices.

More Details

TYPE Journal Article YEAR 2016

Scopus OSTI DOI

Kokkos Task API: A Use Case in Tacho

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Edwards, Harold C.; Olivier, Stephen L.; Stelle, George

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

High Performance Computing - Power Application Programming Interface Specification Version 1.4

Laros, James H.; DeBonis, David D.; Grant, Ryan E.; Kelly, Suzanne M.; Levenhagen, Michael J.; Olivier, Stephen L.; Pedretti, Kevin P.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Hierarchical Task-Data Parallelism using Kokkos and Qthreads

Edwards, Harold C.; Olivier, Stephen L.; Berry, Jonathan W.; Mackey, Greg; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Kim, Kyungjoo K.; Stelle, George

This report describes a new capability for hierarchical task-data parallelism using Sandia's Kokkos and Qthreads, and evaluation of this capability with sparse matrix Cholesky factor- ization and social network triangle enumeration mini-applications. Hierarchical task-data parallelism consists of a collection of tasks with executes-after dependences where each task contains data parallel operations performed on a team of hardware threads. The collection of tasks and dependences form a directed acyclic graph of tasks - a task DAG . Major chal- lenges of this research and development effort include: portability and performance across multicore CPU; manycore Intel Xeon Phi, and NVIDIA GPU architectures; scalability with respect to hardware concurrency and size of the task DAG; and usability of the application programmer interface (API).

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Kokkos/Qthreads Task Parallel Approach to Linear Algebra Based Graph Analytics

Wolf, Michael W.; Edwards, Harold C.; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI DOI

Qthreads: Run Time Library Support for Task Parallel Programming

Brightwell, Ronald B.; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Cactus Environment Machine: Shared Environment Call-by-Need

Stelle, George; Stefanovic, Darko S.; Olivier, Stephen L.; Forrest, Stephanie F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

An Overview of Sandia National Laboratory?s High Performance Computing Power Application Programming Interface (API) Specification

Laros, James H.; Pedretti, Kevin P.; Grant, Ryan E.; Olivier, Stephen L.; Levenhagen, Michael J.; DeBonis, David D.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Cactus Environment Machine: Shared Environment Call-by-Need

Stelle, George; Stefanovic, Darko S.; Olivier, Stephen L.; Forrest, Stephanie F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

High Performance Computing - Power Application Programming Interface Specification

Laros, James H.; Kelly, Suzanne M.; Pedretti, Kevin P.; Grant, Ryan E.; Olivier, Stephen L.; Levenhagen, Michael J.; DeBonis, David D.; Laros, James H.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Analysis of Application Sensitivity to System Performance Variability in a Dynamic Task Based Runtime

Shipman, Galen S.; McCormick, Patrick M.; Pedretti, Kevin P.; Olivier, Stephen L.; Ferreira, Kurt B.; Sankaran, Ramanan S.; Treichler, Sean T.; Aiken, Alex A.; Bauer, Michael B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

ACES and Cray Collaborate on Advanced Power Management for Trinity

Laros, James H.; Pedretti, Kevin P.; Grant, Ryan E.; Olivier, Stephen L.; Levenhagen, Michael J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

High Performance Computing - Power Application Programming Interface Specification

Laros, James H.; Kelly, Suzanne M.; Pedretti, Kevin P.; Grant, Ryan E.; Olivier, Stephen L.; Levenhagen, Michael J.; DeBonis, David D.; Laros, James H.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [131, 3, 5, 11), 4, a, B, Ili, 7, T71,, a 11 11, 1, 6, IA, ]112]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager. KC

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

Overcoming Challenges in Scalable Power Monitoring with the Power API

Grant, Ryan E.; Levenhagen, Michael J.; Olivier, Stephen L.; DeBonis, David D.; Pedretti, Kevin P.; Laros, James H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI DOI

Task Parallel Incomplete Cholesky Factorization using 2D Partitioned-Block Layout

Kim, Kyungjoo K.; Rajamanickam, Sivasankaran R.; Stelle, George; Edwards, Harold C.; Olivier, Stephen L.

We introduce a task-parallel algorithm for sparse incomplete Cholesky factorization that utilizes a 2D sparse partitioned-block layout of a matrix. Our factorization algorithm follows the idea of algorithms-by-blocks by using the block layout. The algorithm-byblocks approach induces a task graph for the factorization. These tasks are inter-related to each other through their data dependences in the factorization algorithm. To process the tasks on various manycore architectures in a portable manner, we also present a portable tasking API that incorporates different tasking backends and device-specific features using an open-source framework for manycore platforms i.e., Kokkos. A performance evaluation is presented on both Intel Sandybridge and Xeon Phi platforms for matrices from the University of Florida sparse matrix collection to illustrate merits of the proposed task-based factorization. Experimental results demonstrate that our task-parallel implementation delivers about 26.6x speedup (geometric mean) over single-threaded incomplete Choleskyby- blocks and 19.2x speedup over serial Cholesky performance which does not carry tasking overhead using 56 threads on the Intel Xeon Phi processor for sparse matrices arising from various application problems.

More Details

TYPE Other Report YEAR 2016

OSTI DOI

Publications