Publications Search

MPI usage patterns are changing as applications move towards fully-multithreaded runtimes. However, the impact of these patterns on MPI message matching is not well-studied. In particular, MPI’s mechanic for receiver-side data placement, message matching, can be impacted by increased message volume and nondeterminism incurred by multithreading. While there has been significant developer interest and work to provide an efficient MPI interface for multithreaded access, there has not been a study showing how these patterns affect messaging patterns and matching behavior. In this paper, we present a framework for studying the effects of multithreading on MPI message matching. This framework allows us to explore the implications of different common communication patterns and thread-level decompositions. We present a study of these impacts on the architecture of two of the Top 10 supercomputers (NERSC’s Cori and LANL’s Trinity). This data provides a baseline to evaluate reasonable matching engine queue lengths, search depths, and queue drain times under the multithreaded model. Furthermore, the study highlights surprising results on the challenge posed by message matching for multithreaded application performance.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Brightwell, Ronald B.

Containerization, or OS-level virtualization has taken root within the computing industry. However, container utilization and its impact on performance and functionality within High Performance Computing (HPC) is still relatively undefined. This paper investigates the use of containers with advanced supercomputing and HPC system software. With this, we define a model for parallel MPI application DevOps and deployment using containers to enhance development effort and provide container portability from laptop to clouds or supercomputers. In this endeavor, we extend the use of Sin- gularity containers to a Cray XC-series supercomputer. We use the HPCG and IMB benchmarks to investigate potential points of overhead and scalability with containers on a Cray XC30 testbed system. Furthermore, we also deploy the same containers with Docker on Amazon's Elastic Compute Cloud (EC2), and compare against our Cray supercomputer testbed. Our results indicate that Singularity containers operate at native performance when dynamically linking Cray's MPI libraries on a Cray supercomputer testbed, and that while Amazon EC2 may be useful for initial DevOps and testing, scaling HPC applications better fits supercomputing resources like a Cray.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Groves, Taylor L.; Grant, Ryan; Gonzales, Aaron; Arnold, Dorian

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

Power API User Experiences

Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Power API Overview

Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design

Ferreira, Kurt; Grant, Ryan; Levenhagen, Michael; Levy, Scott L.N.; Groves, Taylor

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Tail Queues - A Multi-threaded Matching Architecture

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Characterizing MPI matching via trace-based simulation

ACM International Conference Proceeding Series

Ferreira, Kurt; Levy, Scott L.N.; Foulk, James W.; Grant, Ryan

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application’s execution, including its matching performance, and is highly dependent on the MPI library’s matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

Enabling Diverse Software Stacks on Supercomputers Using High Performance Virtual Clusters

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Gaines, Brian; Brightwell, Ronald B.

While large-scale simulations have been the hallmark of the High Performance Computing (HPC) community for decades, Large Scale Data Analytics (LSDA) workloads are gaining attention within the scientific community not only as a processing component to large HPC simulations, but also as standalone scientific tools for knowledge discovery. With the path towards Exascale, new HPC runtime systems are also emerging in a way that differs from classical distributed computing models. However, system software for such capabilities on the latest extreme-scale DOE supercomputing needs to be enhanced to more appropriately support these types of emerging software ecosystems.In this paper, we propose the use of Virtual Clusters on advanced supercomputing resources to enable systems to support not only HPC workloads, but also emerging big data stacks. Specifically, we have deployed the KVM hypervisor within Cray's Compute Node Linux on a XC-series supercomputer testbed. We also use libvirt and QEMU to manage and provision VMs directly on compute nodes, leveraging Ethernet-over-Aries network emulation. To our knowledge, this is the first known use of KVM on a true MPP supercomputer. We investigate the overhead our solution using HPC benchmarks, both evaluating single-node performance as well as weak scaling of a 32-node virtual cluster. Overall, we find single node performance of our solution using KVM on a Cray is very efficient with near-native performance. However overhead increases by up to 20% as virtual cluster size increases, due to limitations of the Ethernet-over-Aries bridged network. Furthermore, we deploy Apache Spark with large data analysis workloads in a Virtual Cluster, effectively demonstrating how diverse software ecosystems can be supported by High Performance Virtual Clusters.

More Details

TYPE Conference Poster YEAR 2017

OSTI Scopus

Enabling Diverse Software Stacks on Supercomputers using High Performance Virtual Clusters

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Gaines, Brian; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Vanguard: Maturing the ARM Software Ecosystem for U.S. DOE Supercomputing

Foulk, James W.; Foulk, James W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.; Martinez, David; Noe, John P.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Preparing MPI for Exascale

Grant, Ryan; Dosanjh, Matthew G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Evaluating Energy and Power Profiling Techniques for HPC Workloads

Younge, Andrew J.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

PowerAPI: Power Measurement and Control for the Extreme Scale Computing Community

Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

The Portals 4.1 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammel

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

High Performance Computing - Power Application Programming Interface Specification Version 2.0

Foulk, James W.; Grant, Ryan; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.; Younge, Andrew J.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Understanding and Avoiding Performance Variability in High Performance Networks

Grant, Ryan; Groves, Taylor L.; Foulk, James W.; Gentile, Ann C.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

sPIN: High-performance streaming Processing in the Network

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Hoefler, Torsten; Di Girolamo, Salvatore; Taranov, Konstantin; Grant, Ryan; Brightwell, Ronald B.

Optimizing communication performance is imperative for largescale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.

More Details

TYPE Conference Poster YEAR 2017

OSTI Scopus

Modeling Concurrent Point-to-Point Communication Cost in MPI Performance Models

Farmer, Shane; Skjellum, Anthony; Bridges, Patrick G.; Dosanjh, Matthew G.; Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

The Portals Network API

Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

High Performance Computing - Power Application Programming Interface Specification Version 1.4

Foulk, James W.; Debonis, David; Grant, Ryan; Kelly, Suzanne M.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Publications

Search results