Center for Computing Research (CCR)

A Machine Learning Approach to Understanding HPC Application Performance Variation

Schwaller, Benjamin S.; Aksar, Burak A.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Abstract machine models and proxy architectures for exascale computing

Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Ang, James A.; Barrett, R.F.; Benner, R.E.; Burke, D.; Chan, C.; Cook, J.; Donofrio, D.; Hammond, Simon D.; Hemmert, Karl S.; Kelly, Suzanne M.; Le, H.; Leung, Vitus J.; Resnick, D.R.; Rodrigues, Arun; Shalf, J.; Stark, Dylan S.; Unat, D.; Wright, N.J.

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

More Details

TYPE Conference Poster YEAR 2014

Scopus OSTI DOI

Abstract Machine Models and Proxy Architectures for Exascale Computing

Ang, James A.; Barrett, Richard F.; Benner, R.E.; Burke, Daniel B.; Chan, Cy P.; Cook, Jeanine C.; Daley, Christopher D.; Donofrio, Dave D.; Hammond, Simon D.; Hemmert, Karl S.; Hoekstra, Robert J.; Ibrahim, Khaled I.; Kelly, Suzanne M.; Le, Hoang L.; Leung, Vitus J.; Michelogiannakis, George M.; Resnick, David R.; Rodrigues, Arun; Shalf, John S.; Stark, Dylan S.; Unat, D.U.; Wright, Nick W.; Voskuilen, Gwendolyn R.

Machine Models and Proxy Architectures for Exascale Computing Version 2.0 Prepared by Sandia National Laboratories Albuquerque, New Mexico 87185 and Livermore, California 94550 Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000. Approved for public release; further dissemination unlimited. Issued by Sandia National Laboratories, operated for the United States Department of Energy by Sandia Corporation. NOTICE: This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government, nor any agency thereof, nor any of their employees, nor any of their contractors, subcontractors, or their employees, make any warranty, express or implied, or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or rep- resent that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government, any agency thereof, or any of their contractors or subcontractors. The views and opinions expressed herein do not necessarily state or reflect those of the United States Government, any agency thereof, or any of their contractors. Printed in the United States of America. This report has been reproduced directly from the best available copy. Available to DOE and DOE contractors from U.S. Department of Energy Office of Scientific and Technical Information P.O. Box 62 Oak Ridge, TN 37831 Telephone: (865) 576-8401 Facsimile: (865) 576-5728 E-Mail: reports@adonis.osti.gov Online ordering: http://www.osti.gov/bridge Available to the public from U.S. Department of Commerce National Technical Information Service 5285 Port Royal Rd Springfield, VA 22161 Telephone: (800) 553-6847 Facsimile: (703) 605-6900 E-Mail: orders@ntis.fedworld.gov Online ordering: http://www.ntis.gov/help/ordermethods.asp?loc=7-4-0#online D E P A R T M E N T O F E N E R G Y * * U N I T E D S T A T E S O F A M E R I C A SAND2016-6049 Unlimited Release Printed Abstract Machine Models and Proxy Architectures for Exascale Computing Version 2.0 J.A. Ang 1 , R.F. Barrett 1 , R.E. Benner 1 , D. Burke 2 , C. Chan 2 , J. Cook 1 , C.S. Daley 2 , D. Donofrio 2 , S.D. Hammond 1 , K.S. Hemmert 1 , R.J. Hoekstra 1 , K. Ibrahim 2 , S.M. Kelly 1 , H. Le, V.J. Leung 1 , G. Michelogiannakis 2 , D.R. Resnick 1 , A.F. Rodrigues 1 , J. Shalf 2 , D. Stark, D. Unat, N.J. Wright 2 , G.R. Voskuilen 1 1 1 Sandia National Laboratories, P.O. Box 5800, Albuquerque, New Mexico 87185-MS 1319 2 Lawrence Berkeley National Laboratory, Berkeley, California Abstract To achieve exascale computing, fundamental hardware architectures must change. The most sig- nificant consequence of this assertion is the impact on the scientific and engineering applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. In order to adapt to exascale architectures, developers must be able to reason about new hardware and deter- mine what programming models and algorithms will provide the best blend of performance and energy efficiency into the future. While many details of the exascale architectures are undefined, an abstract machine model is designed to allow application developers to focus on the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. We use the term proxy architecture to describe a parameterized version of an abstract machine model, with the parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models are formulated to enable discussion between the developers of analytic models and simulators and computer hardware archi- tects. They allow for application performance analysis and hardware optimization opportunities. In this report our goal is to provide the application development community with a set of mod- els that can help software developers prepare for exascale. In addition, through the use of proxy architectures, we can enable a more concrete exploration of how well new and evolving applica- tion codes map onto future architectures. This second version of the document addresses system scale considerations and provides a system-level abstract machine model with proxy architecture information.

More Details

TYPE SAND Report YEAR 2016

OSTI DOI

AD for Machine Learning Approach to Understanding HPC Application Performance Variation Poster

Aksar, Burak A.; Schwaller, Benjamin S.; Aaziz, Omar R.; Ates, Emre A.; Brandt, James M.; Coskun, Ayse K.; Egele, Manuel E.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Adverse event prediction using graph-augmented temporal analysis

Brost, Randolph B.; Leung, Vitus J.; Link, Hamilton E.; Phillips, Cynthia A.; Staid, Andrea S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Adverse Event Prediction Using Graph-Augmented Temporal Analysis: Final Report

Brost, Randolph B.; Carrier, Erin E.; Carroll, Michelle C.; Groth, Katrina M.; Kegelmeyer, William P.; Leung, Vitus J.; Link, Hamilton E.; Patterson, Andrew J.; Phillips, Cynthia A.; Richter, Samuel N.; Robinson, David G.; Staid, Andrea S.; Woodbridge, Diane M.-K.

This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to de- velop a method for analyzing multiple time-series data streams to identify precursors provid- ing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub- missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.

More Details

TYPE SAND Report YEAR 2018

OSTI DOI

Algorithmic support for commodity-based parallel computing systems

Leung, Vitus J.; Leung, Vitus J.; Phillips, Cynthia A.

The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.

More Details

TYPE Report YEAR 2003

OSTI DOI

Architecture-aware Task Placement

Deveci, Mehmet D.; Devine, Karen D.; Leung, Vitus J.; Prokopenko, Andrey V.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Backfilling with guarantees granted upon job submission

Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Backfilling with guarantees granted upon job submission

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Lindsay, Alexander M.; Galloway-Carson, Maxwell; Johnson, Christopher R.; Bunde, David P.; Leung, Vitus J.

In this paper, we present scheduling algorithms that simultaneously support guaranteed starting times and favor jobs with system-desired traits. To achieve the first of these goals, our algorithms keep a profile with potential starting times for every unfinished job and never move these starting times later, just as in Conservative Backfilling. To achieve the second, they exploit previously unrecognized flexibility in the handling of holes opened in this profile when jobs finish early. We find that, with one choice of job selection function, our algorithms can consistently yield a lower average waiting time than Conservative Backfilling while still providing a guaranteed start time to each job as it arrives. In fact, in most cases, the algorithms give a lower average waiting time than the more aggressive EASY backfilling algorithm, which does not provide guaranteed start times. Alternately, with a different choice of job selection function, our algorithms can focus the benefit on the widest submitted jobs, the reason for the existence of parallel systems. In this case, these jobs experience significantly lower waiting time than Conservative Backfilling with minimal impact on other jobs. © 2011 Springer-Verlag.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Backfilling with guarantees made as jobs arrive

Proposed for publication in Concurrency and Computation: Practice and Experience.

Leung, Vitus J.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Beyond BUS-PASS

Leung, Vitus J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Brief announcement: Subgraph Isomorphism on a MultiThreaded shared memory architecture

Annual ACM Symposium on Parallelism in Algorithms and Architectures

Ralph, Claire C.; Leung, Vitus J.; McLendon, William C.

Graph algorithms tend to suffer poor performance due to the irregularity of access patterns within general graph data structures, arising from poor data locality, which translates to high memory latency. The result is that advances in high-performance solutions for graph algorithms are most likely to come through advances in both architectures and algorithms. Specialized MMT shared memory machines offer a potentially transformative environment in which to approach the problem. Here, we explore the challenges of implementing Subgraph Isomorphism (SI) algorithms based on the Ullmann and VF2 algorithms in the Cray XMT environment, where issues of memory contention, scheduling, and compiler parallelizability must be optimized. Copyright is held by the author/owner(s).

More Details

TYPE Conference YEAR 2012

Scopus OSTI

Communication patterns and allocation strategies

Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM)

Bunde, David P.; Leung, Vitus J.; Mache, Jens

Motivated by observations about job runtimes on the CPlant system, we use a trace-driven microsimulator to begin characterizing the performance of different classes of allocation algorithms on jobs with different communication patterns in space-shared parallel systems with mesh topology. We show that relative performance varies considerably with communication pattern. The Paging strategy using the Hilbert space-filling curve and the Best Fit heuristic performed best across several communication patterns.

More Details

TYPE Conference YEAR 2004

Scopus OSTI DOI

Communication-aware processor allocation for supercomputers

Leung, Vitus J.; Phillips, Cynthia A.

We give processor-allocation algorithms for grid architectures, where the objective is to select processors from a set of available processors to minimize the average number of communication hops. The associated clustering problem is as follows: Given n points in R{sup d}, find a size-k subset with minimum average pairwise L{sub 1} distance.We present a natural approximation algorithm and show that it is a 7/4-approximation for 2D grids. In d dimensions, the approximation guarantee is 2 - 1/2d, which is tight. We also give a polynomial-time approximation scheme (PTAS) for constant dimension d and report on experimental results.

More Details

TYPE Conference YEAR 2004

OSTI

Community Detection vis Facility Location

Phycisal Review Letters

Laviolette, Randall A.; Hendrickson, Bruce A.; Leung, Vitus J.; Phillips, Cynthia A.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2007

OSTI

Comparing global link arrangements for Dragonfly networks

Hastings, Emily H.; Rincon-Cruz, David R.; Spehlmann, Marc S.; Meyers, Sofia M.; Xu, Anda X.; Bunde, David P.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI DOI

Comparing global link arrangements for Dragonfly networks

Hastings, Emily H.; Rincon-Cruz, David R.; Spehlmann, Marc S.; Meyers, Sofia M.; Xu, Anda X.; Bunde, David P.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI DOI

Comparing global link arrangements for Dragonfly networks

Hastings, Emily H.; Rincon-Cruz, David R.; Spehlmann, Marc S.; Meyers, Sofia M.; Xu, Anda X.; Bunde, David P.; Leung, Vitus J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI DOI

Comparing global link arrangements for dragonfly networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Hastings, Emily; Rincon-Cruz, David; Spehlmann, Marc; Meyers, Sofia; Xu, Anda; Bunde, David P.; Leung, Vitus J.

High-performance computing systems are shifting away from traditional interconnect topologies to exploit new technologies and to reduce interconnect power consumption. The Dragonfly topology is one promising candidate for new systems, with several variations already in production. It is hierarchical, with local links forming groups and global links joining the groups. At each level, the interconnect is a clique, with a link between each pair of switches in a group and a link between each pair of groups. This paper shows that the intergroup links can be made in meaningfully different ways. We evaluate three previously-proposed approaches for link organization (called global link arrangements) in two ways. First, we use bisection bandwidth, an important and commonly-used measure of the potential for communication bottlenecks. We show that the global link arrangements often give bisection bandwidths differing by 10s of percent, with the specific separation varying based on the relative bandwidths of local and global links. For the link bandwidths used in a current Dragonfly implementation, it is 33%. Second, we show that the choice of global link arrangement can greatly impact the regularity of task mappings for nearest neighbor stencil communication patterns, an important pattern in scientific applications.

More Details

TYPE Presentation YEAR 2015

Scopus OSTI DOI