Publications

130 Results

Design of a Portable Implementation of Partitioned Point-to-Point Communication Primitives

Worley, Andrew; Schafer, Derek; Bangalore, Purushotham V.; Dosanjh, Matthew G.; Grant, Ryan; Skjellum, Anthony; Ghafoor, Ghafoor

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

MiniMod: A Modular Miniapplication Benchmarking Framework for HPC

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Levy, Scott L.N.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

DOI OSTI

MPI Partitioned Communication MPI BoF ISC 2021

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

RVMA: Remote Virtual Memory Access (long presentation)

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

RVMA: Remote Virtual Memory Access Virtual Live Presentation

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

MPI Partitioned Communication for Highly Concurrent and Heterogeneous Systems

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Co-design of System Software for Compute Accelerators and SmartNICs

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

SNL ATDM Software Ecosystem Operating Systems and On-Node Runtime

Olivier, Stephen L.; Brightwell, Ronald B.; Ferreira, Kurt; Grant, Ryan; Levy, Scott L.N.; Foulk, James W.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Presentation YEAR 2021

OSTI

Co-design of System Software for Compute Accelerators and SmartNICs

Grant, Ryan; Levy, Scott L.N.; Schonbein, William W.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2021

OSTI

ALAMO: Autonomous lightweight allocation, management, and optimization

Communications in Computer and Information Science

Brightwell, Ronald B.; Ferreira, Kurt; Grant, Ryan; Levy, Scott L.N.; Lofstead, Gerald F.; Olivier, Stephen L.; Foulk, James W.; Younge, Andrew J.; Gentile, Ann C.; Foulk, James W.

Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.

More Details

TYPE Conference Poster YEAR 2021

OSTI Scopus

RVMA: Remote Virtual Memory Access

Grant, Ryan; Levenhagen, Michael; Dosanjh, Matthew G.; Widener, Patrick

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, William W.; Levy, Scott L.N.; Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

TYPE Conference Paper YEAR 2020

DOI OSTI Scopus

Radd runtimes: Radical and different distributed runtimes with smartnics

Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Grant, Ryan; Schonbein, Whit; Levy, Scott L.N.

As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.

More Details

TYPE Conference Paper YEAR 2020

OSTI Scopus

Enabling Power Measurement and Control on Astra: The First Petascale Arm Supercomputer

Grant, Ryan; Hammond, Simon; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.; Younge, Andrew J.

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

OSTI

Low-cost MPI Multithreaded Message Matching Benchmarking

Schonbein, William W.; Grant, Ryan; Levy, Scott L.N.; Dosanjh, Matthew G.; Marts, William P.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

RaDD Runtimes:Radical and Different Distributed Runtimes with SmartNICs

Grant, Ryan; Schonbein, William W.; Levy, Scott L.N.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI

MPI Partitioned Communication

Grant, Ryan; Dosanjh, Matthew G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

A Portable Implementation of Partitioned Point-to-Point Communication Primitives

Bangalore, Purushotham; Worley, Andrew; Schafer, Derek; Grant, Ryan; Skjellum, Anthony; Ghafoor, Ghafoor

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

PowerAPI: A Standardized Interface to Power/Energy Monitoring and Control

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Tail queues: A multi-threaded matching architecture

Concurrency and Computation: Practice and Experience

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick G.

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

More Details

TYPE Journal Article YEAR 2020

DOI OSTI Scopus

MPI Partitioned Communication

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Intelligent High-Performance Networks Via INCA

Grant, Ryan; Schonbein, William W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications

Parallel Computing

Ghazimirsaeed, S.M.; Grant, Ryan; Afsahi, Ahmad

The Message Passing Interface (MPI) libraries use message queues to guarantee correct message ordering between communicating processes. Message queues are in the critical path of MPI communications and thus, the performance of message queue operations can have significant impact on the performance of applications. Collective communications are widely used in MPI applications and they can have considerable impact on generating long message queues. In this paper, we propose a unified message matching mechanism that improves the message queue search time by distinguishing messages coming from point-to-point and collective communications and using a distinct message queue data structure for them. For collective operations, it dynamically profiles the impact of each collective call on message queues during the application runtime and uses this information to adapt the message queue data structure for each collective dynamically. Moreover, we use a partner/non-partner message queue data structure for the messages coming from point-to-point communications. The proposed approach can successfully reduce the queue search time while maintaining scalable memory consumption. The evaluation results show that we can obtain up to 5.5x runtime speedup for applications with long list traversals. Moreover, we can gain up to 15% and 94% queue search time improvement for all elements in applications with short and medium list traversals, respectively.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Enabling HPC Workloads on Cloud Infrastructure Using Kubernetes Container Orchestration Mechanisms

Proceedings of CANOPIE-HPC 2019: 1st International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Beltre, Angel M.; Saha, Pankaj; Govindaraju, Madhusudhan; Grant, Ryan; Younge, Andrew J.

Containers offer a broad array of benefits, including a consistent lightweight runtime environment through OS-level virtualization, as well as low overhead to maintain and scale applications with high efficiency. Moreover, containers are known to package and deploy applications consistently across varying infrastructures. Container orchestrators manage a large number of containers for microservices based cloud applications. However, the use of such service orchestration frameworks towards HPC workloads remains relatively unexplored. In this paper we study the potential use of Kubernetes on HPC infrastructure for use by the scientific community. We directly compare both its features and performance against Docker Swarm and bare metal execution of HPC applications. Herein, we detail the configurations required for Kubernetes to operate with containerized MPI applications, specifically accounting for operations such as (1) underlying device access, (2) inter-container communication across different hosts, and (3) configuration limitations. This evaluation quantifies the performance difference between representative MPI workloads running both on bare metal and containerized orchestration frameworks with Kubernetes, operating over both Ethernet and InfiniBand interconnects. Our results show that Kubernetes and Docker Swarm can achieve near bare metal performance over RDMA communication when high performance transports are enabled. Our results also show that Kubernetes presents overheads for several HPC applications over TCP/IP protocol. However, Docker Swarm's throughput is near bare metal performance for the same applications.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

The Case for Modular Generalizable Proxy Applications for Systems Software Research

Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

INCA: In-Network Compute Assistance

Schonbein, William W.; Grant, Ryan; Dosanjh, Matthew G.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

MPI Tag Matching Performance on ConnectX and ARM

Marts, William P.; Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Intelligent NICs and Threading for MPI

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Using simulation to examine the effect of MPI message matching costs on application performance

Parallel Computing

Levy, Scott L.N.; Ferreira, Kurt; Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Fuzzy matching: Hardware accelerated MPI communication middleware

Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick G.; Gazimirsaeed, S.M.; Afsahi, Ahmad

Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI-THREAD-MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Fuzzy Matching: Accelerating MPI Matching with Vector Comparisons Based on Partial Truth

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick; Gazimirsaeed, Mahdieh; Asafi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Future High-Performance Networks

Abstract not provided.

More Details

TYPE Presentation YEAR 2019

OSTI

Intelligent NICs and the Future of MPI

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching

Concurrency and Computation. Practice and Experience

Ferreira, Kurt; Grant, Ryan; Levenhagen, Michael; Levy, Scott L.N.; Groves, Taylor

Here, this paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. Finally, this paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI

Finepoints: Partitioned multithreaded MPI communication

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Grant, Ryan; Dosanjh, Matthew G.; Levenhagen, Michael; Brightwell, Ronald B.; Skjellum, Anthony

The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI. In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future. To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate upÂ to a 12 × reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing upÂ to 26.1% improvement in communication time and upÂ to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.; Grant, Ryan; Hjelm, Nathan; Levy, Scott L.N.; Schonbein, William W.

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

TYPE Book YEAR 2019

OSTI Scopus

SNL ATDM Software Ecosystem

Olivier, Stephen L.; Brightwell, Ronald B.; Foulk, James W.; Younge, Andrew J.; Evans, Noah; Levy, Scott L.N.; Ferreira, Kurt; Grant, Ryan

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

The Portals 4.2 Network Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Riesen, Rolf; Hoefler, Torsten; Maccabe, Arthur B.; Hudson, Trammell

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

A survey of MPI usage in the US exascale computing project

Concurrency and Computation. Practice and Experience

Bernholdt, David E.; Boehm, Swen; Bosilca, George; Venkata, Manjunath G.; Grant, Ryan; Naughton, Thomas; Pritchard, Howard P.; Schulz, Martin; Vallee, Geoffroy R.

The Exascale Computing Project (ECP) is currently the primary effort in the United States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain a more thorough understanding of how the software projects under the ECP are using, and planning to use the Message Passing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey. Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were using MPI. Furthermore, this paper reports the results of that survey for the benefit of the broader community of MPI developers.

More Details

TYPE Journal Article YEAR 2018

DOI OSTI

Characterizing MPI matching via trace-based simulation

Parallel Computing

Ferreira, Kurt; Levy, Scott L.N.; Foulk, James W.; Grant, Ryan

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application's execution, including its matching performance, and is highly dependent on the MPI library's matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads and microbenchmarks, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

TYPE Conference Poster YEAR 2018

OSTI Scopus

FY18 L2 Milestone #6360 Report: Initial Capability of an Arm-based Advanced Architecture Prototype System and Software Environment

Foulk, James W.; Foulk, James W.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Vanguard Astra and ATSE – an ARM-based Advanced Architecture Prototype System and Software Environment (FY18 L2 Milestone #8759 Report)

Foulk, James W.; Foulk, James W.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

The case for semi-permanent cache occupancy

ACM International Conference Proceeding Series

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Foulk, James W.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops Ipdpsw 2018

Foulk, James W.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

A Dedicated Message Matching Mechanism for Collective Communications

Ghazimirsaeed, Mahdieh; Grant, Ryan; Afsahi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Improving MPI Multi-threaded RMA Performance

Hjelm, Nathan; Dosanjh, Matthew G.; Groves, Taylor; Grant, Ryan; Brightwell, Ronald B.; Bridges, Patrick; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

The Case for Semi-Permanent Cache Occupancy

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Portals 4: Status of Specification and Implementation

Younge, Andrew J.; Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Measuring Multithreaded Message Matching Misery

Schonbein, William W.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Advanced Power Measurement and Control for the Trinity Supercomputer

Younge, Andrew J.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Measuring Multithreaded Message Matching Misery

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Schonbein, William W.; Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.

MPI usage patterns are changing as applications move towards fully-multithreaded runtimes. However, the impact of these patterns on MPI message matching is not well-studied. In particular, MPI’s mechanic for receiver-side data placement, message matching, can be impacted by increased message volume and nondeterminism incurred by multithreading. While there has been significant developer interest and work to provide an efficient MPI interface for multithreaded access, there has not been a study showing how these patterns affect messaging patterns and matching behavior. In this paper, we present a framework for studying the effects of multithreading on MPI message matching. This framework allows us to explore the implications of different common communication patterns and thread-level decompositions. We present a study of these impacts on the architecture of two of the Top 10 supercomputers (NERSC’s Cori and LANL’s Trinity). This data provides a baseline to evaluate reasonable matching engine queue lengths, search depths, and queue drain times under the multithreaded model. Furthermore, the study highlights surprising results on the challenge posed by message matching for multithreaded application performance.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

A Tale of Two Systems: Using Containers to Deploy HPC Applications on Supercomputers and Clouds

Proceedings of the International Conference on Cloud Computing Technology and Science, CloudCom

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Brightwell, Ronald B.

Containerization, or OS-level virtualization has taken root within the computing industry. However, container utilization and its impact on performance and functionality within High Performance Computing (HPC) is still relatively undefined. This paper investigates the use of containers with advanced supercomputing and HPC system software. With this, we define a model for parallel MPI application DevOps and deployment using containers to enhance development effort and provide container portability from laptop to clouds or supercomputers. In this endeavor, we extend the use of Sin- gularity containers to a Cray XC-series supercomputer. We use the HPCG and IMB benchmarks to investigate potential points of overhead and scalability with containers on a Cray XC30 testbed system. Furthermore, we also deploy the same containers with Docker on Amazon's Elastic Compute Cloud (EC2), and compare against our Cray supercomputer testbed. Our results indicate that Singularity containers operate at native performance when dynamically linking Cray's MPI libraries on a Cray supercomputer testbed, and that while Amazon EC2 may be useful for initial DevOps and testing, scaling HPC applications better fits supercomputing resources like a Cray.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Unraveling Network-induced Memory Contention: Deeper Insights with Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Groves, Taylor L.; Grant, Ryan; Gonzales, Aaron; Arnold, Dorian

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems enabling asynchronous data transfers, so that applications may fully utilize CPU resources while simultaneously sharing data amongst remote nodes. We examine Network-induced Memory Contention (NiMC) on Infiniband networks. We expose the interactions between RDMA, main-memory and cache, when applications and out-of-band services compete for memory resources. We then explore NiMCs resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantify NiMC and show that NiMCs impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. In addition, this work examines the problem of predicting NiMC's impact on applications by leveraging machine learning and easily accessible performance counters. This approach provides additional insights about the root cause of NiMC and facilitates dynamic selection of potential solutions. Finally, we evaluated three potential techniques to reduce NiMCs impact, namely hardware offloading, core reservation and network throttling.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

Power API User Experiences

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Power API Overview

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Hardware MPI Message Matching: Insights into MPI Matching Behavior to Inform Design

Ferreira, Kurt; Grant, Ryan; Levenhagen, Michael; Levy, Scott L.N.; Groves, Taylor

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Tail Queues - A Multi-threaded Matching Architecture

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI

Enabling Diverse Software Stacks on Supercomputers Using High Performance Virtual Clusters

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Gaines, Brian; Brightwell, Ronald B.

While large-scale simulations have been the hallmark of the High Performance Computing (HPC) community for decades, Large Scale Data Analytics (LSDA) workloads are gaining attention within the scientific community not only as a processing component to large HPC simulations, but also as standalone scientific tools for knowledge discovery. With the path towards Exascale, new HPC runtime systems are also emerging in a way that differs from classical distributed computing models. However, system software for such capabilities on the latest extreme-scale DOE supercomputing needs to be enhanced to more appropriately support these types of emerging software ecosystems.In this paper, we propose the use of Virtual Clusters on advanced supercomputing resources to enable systems to support not only HPC workloads, but also emerging big data stacks. Specifically, we have deployed the KVM hypervisor within Cray's Compute Node Linux on a XC-series supercomputer testbed. We also use libvirt and QEMU to manage and provision VMs directly on compute nodes, leveraging Ethernet-over-Aries network emulation. To our knowledge, this is the first known use of KVM on a true MPP supercomputer. We investigate the overhead our solution using HPC benchmarks, both evaluating single-node performance as well as weak scaling of a 32-node virtual cluster. Overall, we find single node performance of our solution using KVM on a Cray is very efficient with near-native performance. However overhead increases by up to 20% as virtual cluster size increases, due to limitations of the Ethernet-over-Aries bridged network. Furthermore, we deploy Apache Spark with large data analysis workloads in a Virtual Cluster, effectively demonstrating how diverse software ecosystems can be supported by High Performance Virtual Clusters.

More Details

TYPE Conference Poster YEAR 2017

OSTI Scopus

Enabling Diverse Software Stacks on Supercomputers using High Performance Virtual Clusters

Younge, Andrew J.; Foulk, James W.; Grant, Ryan; Gaines, Brian; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Vanguard: Maturing the ARM Software Ecosystem for U.S. DOE Supercomputing

Foulk, James W.; Foulk, James W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.; Martinez, David; Noe, John P.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Preparing MPI for Exascale

Grant, Ryan; Dosanjh, Matthew G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Evaluating Energy and Power Profiling Techniques for HPC Workloads

Younge, Andrew J.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

PowerAPI: Power Measurement and Control for the Extreme Scale Computing Community

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

The Portals 4.1 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammel

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

High Performance Computing - Power Application Programming Interface Specification Version 2.0

Foulk, James W.; Grant, Ryan; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.; Ward, Harry L.; Younge, Andrew J.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Understanding and Avoiding Performance Variability in High Performance Networks

Grant, Ryan; Groves, Taylor L.; Foulk, James W.; Gentile, Ann C.; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

sPIN: High-performance streaming Processing in the Network

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Hoefler, Torsten; Di Girolamo, Salvatore; Taranov, Konstantin; Grant, Ryan; Brightwell, Ronald B.

Optimizing communication performance is imperative for largescale computing because communication overheads limit the strong scalability of parallel applications. Today's network cards contain rather powerful processors optimized for data movement. However, these devices are limited to fixed functions, such as remote direct memory access. We develop sPIN, a portable programming model to offload simple packet processing functions to the network card. To demonstrate the potential of the model, we design a cycle-accurate simulation environment by combining the network simulator LogGOPSim and the CPU simulator gem5. We implement offloaded message matching, datatype processing, and collective communications and demonstrate transparent full-application speedups. Furthermore, we show how sPIN can be used to accelerate redundant in-memory filesystems and several other use cases. Our work investigates a portable packet-processing network acceleration model similar to compute acceleration with CUDA or OpenCL. We show how such network acceleration enables an eco-system that can significantly speed up applications and system services.

More Details

TYPE Conference Poster YEAR 2017

OSTI Scopus

Modeling Concurrent Point-to-Point Communication Cost in MPI Performance Models

Farmer, Shane; Skjellum, Anthony; Bridges, Patrick G.; Dosanjh, Matthew G.; Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

The Portals Network API

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

High Performance Computing - Power Application Programming Interface Specification Version 1.4

Foulk, James W.; Debonis, David; Grant, Ryan; Kelly, Suzanne M.; Levenhagen, Michael; Olivier, Stephen L.; Foulk, James W.

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Standardizing Power Monitoring and Control at Exascale

Computer

Grant, Ryan; Levenhagen, Michael; Olivier, Stephen L.; Debonis, David; Foulk, James W.; Foulk, James W.

Power API - the result of collaboration among national laboratories, universities, and major vendors - provides a range of standardized power management functions, from application-level control and measurement to facility-level accounting, including real-time and historical statistics gathering. Support is already available for Intel and AMD CPUs and standalone measurement devices.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

MPI Sessions: Leveraging runtime infrastructure to increase scalability of applications at exascale

ACM International Conference Proceeding Series

Holmes, Daniel; Mohror, Kathryn; Grant, Ryan; Skjellum, Anthony; Schulz, Martin; Bland, Wesley; Squyres, Jeffrey M.

MPI includes all processes in MPI COMM WORLD; this is untenable for reasons of scale, resiliency, and overhead. This paper offers a new approach, extending MPI with a new concept called Sessions, which makes two key contributions: a tighter integration with the underlying runtime system; and a scalable route to communication groups. This is a fundamental change in how we organise and address MPI processes that removes well-known scalability barriers by no longer requiring the global communicator MPI COMM - WORLD.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

Stalled Active and Idle (SAI): Characterizing Large-scale Dragonfly Networks

Groves, Taylor L.; Hammond, Simon; Hemmert, Karl S.; Grant, Ryan; Levenhagen, Michael; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

NiMC: Characterizing and Eliminating Network-Induced Memory Contention

Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Groves, Taylor L.; Grant, Ryan; Arnold, Dorian

Remote Direct Memory Access (RDMA) is expected to be an integral communication mechanism for future exascale systems - enabling asynchronous data transfers, so that applications may fully utilize all CPU resources while simultaneously sharing data amongst remote nodes. We examined this network-induced memory contention (NiMC), the interactions between RDMA and the memory subsystem when applications and out-of-band services compete for memory resources, and NiMC's resulting impact on application-level performance. For a range of hardware technologies and HPC workloads, we quantified NiMC and show that NiMC's impact grows with scale resulting in up to 3X performance degradation at scales as small as 8K processes even in applications that previously have been shown to be performance resilient in the presence of noise. We also evaluated three potential techniques to reduce NiMC's performance impact, namely hardware offloading, core reservation and software-based network throttling. While all three of these solutions show promise, we provide guidelines that help select the best solution for a given environment.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

(SAI) Stalled Active and Idle: Characterizing Power and Performance of Large-Scale Dragonfly Networks

Groves, Taylor L.; Grant, Ryan; Hemmert, Karl S.; Hammond, Simon; Levenhagen, Michael; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

Extreme Computing: Pushing the Frontiers of Science

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

High Performance Computing: Power Application Programming Interface Specification (V.1.3)

Foulk, James W.; Kelly, Suzanne M.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Debonis, David

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

An Overview of Sandia National Laboratory?s High Performance Computing Power Application Programming Interface (API) Specification

Foulk, James W.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Debonis, David

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Overcoming Challenges in Scalable Power Monitoring with the Power API

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

ACES and Cray Collaborate on Advanced Power Management for Trinity

Foulk, James W.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Data Movement with MPI in a Multi-Threaded World

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Overcoming Challenges in Scalable Power Monitoring with the Power API

Grant, Ryan; Levenhagen, Michael; Olivier, Stephen L.; Debonis, David; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

RMA-MT: A Benchmark Suite for Assessing MPI Multi-threaded RMA Performance

Dosanjh, Matthew G.; Groves, Taylor L.; Grant, Ryan; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI

Simplifying MPI Threading Levels

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

SHMEM-MT: A benchmark suite for assessing multi-threaded SHMEM performance

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Weeks, Hans; Dosanjh, Matthew G.; Bridges, Patrick G.; Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

NiMC: Characterizing and Eliminating Network-Induced Memory Contention

Groves, Taylor L.; Grant, Ryan; Arnold, Dorian

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

An Overview of Sandia National Laboratory?s High Performance Computing Power Application Programming Interface (API) Specification

Foulk, James W.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Debonis, David

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Overtime: A Tool for Analyzing Performance Variation due to Network Interference

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Lightweight Threading with MPI Using Persistent Communcations Semantics

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Power API for HPC: Standardizing Power Measurement and Control

Foulk, James W.; Foulk, James W.; Kelly, Suzanne M.; Levenhagen, Michael; Debonis, David; Olivier, Stephen L.; Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Lightweight threading with MPI using Persistent Communications Semantics

Grant, Ryan; Skjellum, Anthony; Bangalore, Purushotham V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Preparing for Exascale: Modeling MPI for Many-Core Systems using Fine-Grain Queues

Bridges, Patrick G.; Dosanjh, Matthew G.; Grant, Ryan; Farmer, Shane; Skjellum, Anthony; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Power Aware Dynamic Provisioning of HPC Networks

Groves, Taylor L.; Grant, Ryan

Future exascale systems are under increased pressure to find power savings. The network, while it consumes a considerable amount of power is often left out of the picture when discussing total system power. Even when network power is being considered, the references are frequently a decade or older and rely on models that lack validation on modern inter- connects. In this work we explore how dynamic mechanisms of an Infiniband network save power and at what granularity we can engage these features. We explore this within the context of the host controller adapter (HCA) on the node and for the fabric, i.e. switches, using three different mechanisms of dynamic link width, frequency and disabling of links for QLogic and Mellanox systems. Our results show that while there is some potential for modest power savings, real world systems need to improved responsiveness to adjustments in order to fully leverage these savings. This page intentionally left blank.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, Steven W.; Franko, Ken; Gamell, Marc; Grant, Ryan; Hammond, Simon; Hollman, David S.; Knight, Samuel; Kolla, Hemanth; Lin, Paul T.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita; Wilke, Jeremiah; Bennett, Janine C.; Clay, Robert L.; Kale, Laxkimant; Jain, Nikhil; Mikida, Eric; Aiken, Alex; Bauer, Michael; Lee, Wonchan; Slaughter, Elliott; Treichler, Sean; Berzins, Martin; Harman, Todd; Humphreys, Alan; Schmidt, John; Sunderland, Dan; Mccormick, Pat; Gutierrez, Samuel; Shulz, Martin; Gamblin, Todd; Bremer, Peer-Timo

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, Steven W.; Franko, Ken; Gamell, Marc; Grant, Ryan; Hammond, Simon; Hollman, David S.; Knight, Samuel; Kolla, Hemanth; Lin, Paul T.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita; Wilke, Jeremiah; Bennett, Janine C.; Clay, Robert L.; Kale, Laxkimant; Jain, Nikhil; Mikida, Eric; Aiken, Alex; Bauer, Michael; Lee, Wonchan; Slaughter, Elliott; Treichler, Sean; Berzins, Martin; Harman, Todd; Humphreys, Alan; Schmidt, John; Sunderland, Dan; Mccormick, Pat; Gutierrez, Samuel; Shulz, Martin; Gamblin, Todd; Bremer, Peer T.

This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Overtime: A Tool for Analyzing Performance Variation due to Network Interference

Grant, Ryan; Foulk, James W.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

High Performance Computing - Power Application Programming Interface Specification

Foulk, James W.; Kelly, Suzanne M.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Debonis, David

Achieving practical exascale supercomputing will require massive increases in energy efficiency. The bulk of this improvement will likely be derived from hardware advances such as improved semiconductor device technologies and tighter integration, hopefully resulting in more energy efficient computer architectures. Still, software will have an important role to play. With every generation of new hardware, more power measurement and control capabilities are exposed. Many of these features require software involvement to maximize feature benefits. This trend will allow algorithm designers to add power and energy efficiency to their optimization criteria. Similarly, at the system level, opportunities now exist for energy-aware scheduling to meet external utility constraints such as time of day cost charging and power ramp rate limitations. Finally, future architectures might not be able to operate all components at full capability for a range of reasons including temperature considerations or power delivery limitations. Software will need to make appropriate choices about how to allocate the available power budget given many, sometimes conflicting considerations.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Asynchronous Many-Task Programming Models for Next Generation Platforms

Wilke, Jeremiah; Bettencourt, Matthew T.; Bova, Steven W.; Franko, Ken; Gamell, Marc; Grant, Ryan; Hammond, Simon; Hollman, David S.; Knight, Samuel; Kolla, Hemanth; Lin, Paul T.; Olivier, Stephen L.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita; Bennett, Janine C.; Clay, Robert L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2015

OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Optimizing Explicit Hydrodynamics for Power Energy and Performance

Leon, Edgar A.; Karlin, Ian; Grant, Ryan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

A Power Application Programming Interface (API) Specification for High Performance Computers (HPC)

Foulk, James W.; Foulk, James W.; Grant, Ryan; Levenhagen, Michael; Debonis, David; Olivier, Stephen L.; Kelly, Suzanne M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Toward an evolutionary task parallel integrated MPI + X Programming Model

Proceedings of the 6th International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM 2015

Barrett, Richard F.; Stark, Dylan T.; Vaughan, Courtenay T.; Grant, Ryan; Olivier, Stephen L.; Foulk, James W.

The Bulk Synchronous Parallel programming model is showing performance limitations at high processor counts. We propose over-decomposition of the domain, operated on as tasks, to smooth out utilization of the computing resource, in particular the node interconnect and processing cores, and hide intra- and inter-node data movement. Our approach maintains the existing coding style commonly employed in computational science and engineering applications. Although we show improved performance on existing computers, up to 131,072 processor cores, the effectiveness of this approach on expected future architectures will require the continued evolution of capabilities throughout the codesign stack. Success then will not only result in decreased time to solution, but would also make better use of the hardware capabilities and reduce power and energy requirements, while fundamentally maintaining the current code configuration strategy.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI Scopus

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick G.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Toward an Evolutionary Task Parallel Integrated MPI + X Programming Model

Barrett, Richard F.; Stark, Dylan T.; Vaughan, Courtenay T.; Grant, Ryan; Olivier, Stephen L.; Foulk, James W.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

Metrics for evaluating energy saving techniques for resilient HPC systems

Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014

Grant, Ryan; Olivier, Stephen L.; Laros, James H.; Brightwell, Ronald B.; Porterfield, Allan K.

The metrics used for evaluating energy saving techniques for future HPC systems are critical to the correct assessment of proposed methods. Current predictions forecast that overcoming reduced system reliability, increased power requirements and energy consumption will be a major design challenge for future systems. Modern runtime energy-saving research efforts do not take into account the energy spent providing reliability. They also do not account for the increase in the probability of failure during application execution due to runtime overhead from energy saving methods. While this is very reasonable for current systems, it is insufficient for future generation systems. By taking into account the energy consumption ramifications of increased runtimes on system reliability, better energy saving techniques can be developed. This paper demonstrates how to determine the impact of runtime energy conservation methods within the context of failure-prone large scale systems. In addition, a survey of several energy savings methodologies is conducted and an analysis is performed with respect to their effectiveness in an environment in which failures occur.

More Details

TYPE Conference YEAR 2014

Scopus OSTI DOI

An evaluation of MPI message rate on hybrid-core processors

International Journal of High Performance Computing Applications

Brightwell, Ronald B.; Barrett, Brian W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

TYPE Journal Article YEAR 2014

DOI OSTI Scopus

Power API for HPC: Standardizing Power Management and Control

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Enabling communication concurrency through flexible MPI endpoints

International Journal of High Performance Computing Applications

MPI defines a one-to-one relationship between MPI processes and ranks. This model captures many use cases effectively; however, it also limits communication concurrency and interoperability between MPI and programming models that utilize threads. Our paper describes the MPI endpoints extension, which relaxes the longstanding one-to-one relationship between MPI processes and ranks. Using endpoints, an MPI implementation can map separate communication contexts to threads, allowing them to drive communication independently. Also, endpoints enable threads to be addressable in MPI operations, enhancing interoperability between MPI and other programming models. Furthermore, these characteristics are illustrated through several examples and an empirical study that contrasts current multithreaded communication performance with the need for high degrees of communication concurrency to achieve peak communication performance.

More Details

TYPE Journal Article YEAR 2014

OSTI DOI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

Re-evaluating Network Onload vs. Offload for the Many-Core Era

Dosanjh, Matthew G.; Grant, Ryan; Bridges, Patrick; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI

A Power API for the HPC Community

Debonis, David; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Kelly, Suzanne M.; Foulk, James W.; Foulk, James W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

High Performance Computing - Power Application Programming Interface Specification (V.1.0)

Foulk, James W.; Kelly, Suzanne M.; Foulk, James W.; Grant, Ryan; Olivier, Stephen L.; Levenhagen, Michael; Debonis, David

Measuring and controlling the power and energy consumption of high performance computing systems by various components in the software stack is an active research area [13, 3, 5, 10, 4, 21, 19, 16, 7, 17, 20, 18, 11, 1, 6, 14, 12]. Implementations in lower level software layers are beginning to emerge in some production systems, which is very welcome. To be most effective, a portable interface to measurement and control features would significantly facilitate participation by all levels of the software stack. We present a proposal for a standard power Application Programming Interface (API) that endeavors to cover the entire software space, from generic hardware interfaces to the input from the computer facility manager.

More Details

TYPE SAND Report YEAR 2014

DOI OSTI

PowerAPI: A Comprehensive Interface for Power/Energy Measurement and Control for Extreme Scale Computing

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Overtime: A Benchmark for Analyzing Performance Variation due to Network Interference

Grant, Ryan; Pedretti, Kevin; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

DOI OSTI

Metrics for Evalua0ng Energy Saving Techniques for Resilient HPC Systems

Grant, Ryan; Olivier, Stephen L.; Laros, James H.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Addressing Power/Energy Challenges for Extreme Scale HPC

Laros, James H.; Kelly, Suzanne M.; Pedretti, Kevin P.; Grant, Ryan; Levenhagen, Michael; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Optimizing Explicit Hydrodynamics for Power Energy and Performance

Abstract not provided.

More Details

TYPE Conference YEAR 2014

OSTI

Early Experiences Co-Scheduling Work and Communication Tasks for Hybrid MPI+X Applications

Proceedings of ExaMPI 2014: Exascale MPI 2014 - held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis

Stark, Dylan T.; Barrett, Richard F.; Grant, Ryan; Olivier, Stephen L.; Foulk, James W.; Vaughan, Courtenay T.

Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.

More Details

TYPE Conference Poster YEAR 2014

DOI OSTI Scopus

Evaluating Energy Savings for Checkpoint/Restart

Grant, Ryan; Ferreira, Kurt

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Scalable Network Communication using Unreliable RDMA

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI DOI

Energy Consumption of Resilience Mechanisms in Large Scale Systems

Mills, Bryan M.; Ferreira, Kurt; Grant, Ryan

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Metrics for Evaluating Energy Saving Techniques for Resilient HPC Systems

Grant, Ryan; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

High Performanace Computing-The Challenge at Exascale

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Grant, Ryan; Barrett, Brian; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Grant, Ryan; Barrett, Brian; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

130 Results

130 Results