Publications

Results 1–50 of 130

Search results

Jump to search filters

ALAMO: Autonomous lightweight allocation, management, and optimization

Communications in Computer and Information Science

Brightwell, Ronald B.; Ferreira, Kurt; Grant, Ryan; Levy, Scott L.N.; Lofstead, Gerald F.; Olivier, Stephen L.; Foulk, James W.; Younge, Andrew J.; Gentile, Ann C.; Foulk, James W.

Several recent workshops conducted by the DOE Advanced Scientific Computing Research program have established the fact that the complexity of developing applications and executing them on high-performance computing (HPC) systems is rising at a rate which will make it nearly impossible to continue to achieve higher levels of performance and scalability. Absent an alternative approach to managing this ever-growing complexity, HPC systems will become increasingly difficult to use. A more holistic approach to designing and developing applications and managing system resources is required. This paper outlines a research strategy for managing the increasing the complexity by providing the programming environment, software stack, and hardware capabilities needed for autonomous resource management of HPC systems. Developing portable applications for a variety of HPC systems of varying scale requires a paradigm shift from the current approach, where applications are painstakingly mapped to individual machine resources, to an approach where machine resources are automatically mapped and optimized to applications as they execute. Achieving such automated resource management for HPC systems is a daunting challenge that requires significant sustained investment in exploring new approaches and novel capabilities in software and hardware that span the spectrum from programming systems to device-level mechanisms. This paper provides an overview of the functionality needed to enable autonomous resource management and optimization and describes the components currently being explored at Sandia National Laboratories to help support this capability.

More Details

Low-cost MPI Multithreaded Message Matching Benchmarking

Proceedings - 2020 IEEE 22nd International Conference on High Performance Computing and Communications, IEEE 18th International Conference on Smart City and IEEE 6th International Conference on Data Science and Systems, HPCC-SmartCity-DSS 2020

Schonbein, William W.; Levy, Scott L.N.; Marts, William P.; Dosanjh, Matthew G.; Grant, Ryan

The Message Passing Interface (MPI) standard allows user-level threads to concurrently call into an MPI library. While this feature is currently rarely used, there is considerable interest from developers in adopting it in the near future. There is reason to believe that multithreaded communication may incur additional message processing overheads in terms of number of items searched during demultiplexing and amount of time spent searching because it has the potential to increase the number of messages exchanged and to introduce non-deterministic message ordering. Therefore, understanding the implications of adding multithreading to MPI applications is important for future application development.One strategy for advancing this understanding is through 'low-cost' benchmarks that emulate full communication patterns using fewer resources. For example, while a complete, 'real-world' multithreaded halo exchange requires 9 or 27 nodes, the low-cost alternative needs only two, making it deployable on systems where acquiring resources is difficult because of high utilization (e.g., busy capacity-computing systems), or impossible because the necessary resources do not exist (e.g., testbeds with too few nodes). While such benchmarks have been proposed, the reported results have been limited to a single architecture or derived indirectly through simulation, and no attempt has been made to confirm that a low-cost benchmark accurately captures features of full (non-emulated) exchanges. Moreover, benchmark code has not been made publicly available.The purpose of the study presented in this paper is to quantify how accurately the low-cost benchmark captures the matching behavior of the full, real-world benchmark. In the process, we also advocate for the feasibility and utility of the low-cost benchmark. We present a 'real-world' benchmark implementing a full multithreaded halo exchange on 9 and 27 nodes, as defined by 5-point and 9-point 2D stencils, and 7-point and 27-point 3D stencils. Likewise, we present a 'low-cost' benchmark that emulates these communication patterns using only two nodes. We then confirm, across multiple architectures, that the low-cost benchmark gives accurate estimates of both number of items searched during message processing, and time spent processing those messages. Finally, we demonstrate the utility of the low-cost benchmark by using it to profile the performance impact of state-of-The-Art Mellanox ConnectX-5 hardware support for offloaded MPI message demultiplexing. To facilitate further research on the effects of multithreaded MPI on message matching behavior, the source of our two benchmarks is to be included in the next release version of the Sandia MPI Micro-Benchmark Suite.

More Details

Radd runtimes: Radical and different distributed runtimes with smartnics

Proceedings of IPDRM 2020: 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Grant, Ryan; Schonbein, Whit; Levy, Scott L.N.

As network speeds increase, the overhead of processing incoming messages is becoming onerous enough that many manufacturers now provide network interface cards (NICs) with offload capabilities to handle these overheads. This increase in NIC capabilities creates an opportunity to enable computation on data in-situ on the NIC. These enhanced NICs can be classified into several different categories of SmartNICs. SmartNICs present an interesting opportunity for future runtime software designs. Designing runtime software to be located in the network as opposed to the host level leads to new radical distributed runtime possibilities that were not practical prior to SmartNICs. In the process of transitioning to a radically different runtime software design for SmartNICs there are intermediary steps of migrating current runtime software to be offloaded onto a SmartNIC that also present interesting possibilities. This paper will describe SmartNIC design and how SmartNICs can be leveraged to offload current generation runtime software and lead to future radically different in-network distributed runtime systems.

More Details

Tail queues: A multi-threaded matching architecture

Concurrency and Computation: Practice and Experience

Dosanjh, Matthew G.; Grant, Ryan; Schonbein, William W.; Bridges, Patrick G.

As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.

More Details

Enabling HPC Workloads on Cloud Infrastructure Using Kubernetes Container Orchestration Mechanisms

Proceedings of CANOPIE-HPC 2019: 1st International Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC - Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Beltre, Angel M.; Saha, Pankaj; Govindaraju, Madhusudhan; Grant, Ryan; Younge, Andrew J.

Containers offer a broad array of benefits, including a consistent lightweight runtime environment through OS-level virtualization, as well as low overhead to maintain and scale applications with high efficiency. Moreover, containers are known to package and deploy applications consistently across varying infrastructures. Container orchestrators manage a large number of containers for microservices based cloud applications. However, the use of such service orchestration frameworks towards HPC workloads remains relatively unexplored. In this paper we study the potential use of Kubernetes on HPC infrastructure for use by the scientific community. We directly compare both its features and performance against Docker Swarm and bare metal execution of HPC applications. Herein, we detail the configurations required for Kubernetes to operate with containerized MPI applications, specifically accounting for operations such as (1) underlying device access, (2) inter-container communication across different hosts, and (3) configuration limitations. This evaluation quantifies the performance difference between representative MPI workloads running both on bare metal and containerized orchestration frameworks with Kubernetes, operating over both Ethernet and InfiniBand interconnects. Our results show that Kubernetes and Docker Swarm can achieve near bare metal performance over RDMA communication when high performance transports are enabled. Our results also show that Kubernetes presents overheads for several HPC applications over TCP/IP protocol. However, Docker Swarm's throughput is near bare metal performance for the same applications.

More Details

A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications

Parallel Computing

Ghazimirsaeed, S.M.; Grant, Ryan; Afsahi, Ahmad

The Message Passing Interface (MPI) libraries use message queues to guarantee correct message ordering between communicating processes. Message queues are in the critical path of MPI communications and thus, the performance of message queue operations can have significant impact on the performance of applications. Collective communications are widely used in MPI applications and they can have considerable impact on generating long message queues. In this paper, we propose a unified message matching mechanism that improves the message queue search time by distinguishing messages coming from point-to-point and collective communications and using a distinct message queue data structure for them. For collective operations, it dynamically profiles the impact of each collective call on message queues during the application runtime and uses this information to adapt the message queue data structure for each collective dynamically. Moreover, we use a partner/non-partner message queue data structure for the messages coming from point-to-point communications. The proposed approach can successfully reduce the queue search time while maintaining scalable memory consumption. The evaluation results show that we can obtain up to 5.5x runtime speedup for applications with long list traversals. Moreover, we can gain up to 15% and 94% queue search time improvement for all elements in applications with short and medium list traversals, respectively.

More Details

Fuzzy matching: Hardware accelerated MPI communication middleware

Proceedings - 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2019

Dosanjh, Matthew G.; Schonbein, William W.; Grant, Ryan; Bridges, Patrick G.; Gazimirsaeed, S.M.; Afsahi, Ahmad

Contemporary parallel scientific codes often rely on message passing for inter-process communication. However, inefficient coding practices or multithreading (e.g., via MPI-THREAD-MULTIPLE) can severely stress the underlying message processing infrastructure, resulting in potentially un-acceptable impacts on application performance. In this article, we propose and evaluate a novel method for addressing this issue: 'Fuzzy Matching'. This approach has two components. First, it exploits the fact most server-class CPUs include vector operations to parallelize message matching. Second, based on a survey of point-to-point communication patterns in representative scientific applications, the method further increases parallelization by allowing matches based on 'partial truth', i.e., by identifying probable rather than exact matches. We evaluate the impact of this approach on memory usage and performance on Knight's Landing and Skylake processors. At scale (262,144 Intel Xeon Phi cores), the method shows up to 1.13 GiB of memory savings per node in the MPI library, and improvement in matching time of 95.9%; smaller-scale runs show run-time improvements of up to 31.0% for full applications, and up to 6.1% for optimized proxy applications.

More Details

Using simulation to examine the effect of MPI message matching costs on application performance

Parallel Computing

Levy, Scott L.N.; Ferreira, Kurt; Schonbein, Whit; Grant, Ryan; Dosanjh, Matthew G.

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching

Concurrency and Computation. Practice and Experience

Ferreira, Kurt; Grant, Ryan; Levenhagen, Michael; Levy, Scott L.N.; Groves, Taylor

Here, this paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. Finally, this paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.

More Details

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.; Grant, Ryan; Hjelm, Nathan; Levy, Scott L.N.; Schonbein, William W.

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

Finepoints: Partitioned multithreaded MPI communication

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Grant, Ryan; Dosanjh, Matthew G.; Levenhagen, Michael; Brightwell, Ronald B.; Skjellum, Anthony

The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI. In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future. To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate up to a 12 × reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing up to 26.1% improvement in communication time and up to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.

More Details

The Portals 4.2 Network Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Riesen, Rolf; Hoefler, Torsten; Maccabe, Arthur B.; Hudson, Trammell

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

A survey of MPI usage in the US exascale computing project

Concurrency and Computation. Practice and Experience

Bernholdt, David E.; Boehm, Swen; Bosilca, George; Venkata, Manjunath G.; Grant, Ryan; Naughton, Thomas; Pritchard, Howard P.; Schulz, Martin; Vallee, Geoffroy R.

The Exascale Computing Project (ECP) is currently the primary effort in the United States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain a more thorough understanding of how the software projects under the ECP are using, and planning to use the Message Passing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey. Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were using MPI. Furthermore, this paper reports the results of that survey for the benefit of the broader community of MPI developers.

More Details

Vanguard Astra and ATSE – an ARM-based Advanced Architecture Prototype System and Software Environment (FY18 L2 Milestone #8759 Report)

Foulk, James W.; Foulk, James W.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads.

More Details

Characterizing MPI matching via trace-based simulation

Parallel Computing

Ferreira, Kurt; Levy, Scott L.N.; Foulk, James W.; Grant, Ryan

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application's execution, including its matching performance, and is highly dependent on the MPI library's matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads and microbenchmarks, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details

FY18 L2 Milestone #6360 Report: Initial Capability of an Arm-based Advanced Architecture Prototype System and Software Environment

Foulk, James W.; Foulk, James W.; Hammond, Simon; Aguilar, Michael J.; Curry, Matthew L.; Grant, Ryan; Hoekstra, Robert J.; Klundt, Ruth A.; Monk, Stephen T.; Ogden, Jeffry B.; Olivier, Stephen L.; Scott, Randall D.; Ward, Harry L.; Younge, Andrew J.

The Vanguard program informally began in January 2017 with the submission of a white paper entitled "Sandia's Vision for a 2019 Arm Testbed" to NNSA headquarters. The program proceeded in earnest in May 2017 with an announcement by Doug Wade (Director, Office of Advanced Simulation and Computing and Institutional R&D at NNSA) that Sandia National Laboratories (Sandia) would host the first Advanced Architecture Prototype platform based on the Arm architecture. In August 2017, Sandia formed a Tri-lab team chartered to develop a robust HPC software stack for Astra to support the Vanguard program goal of demonstrating the viability of Arm in supporting ASC production computing workloads. This document describes the high-level Vanguard program goals, the Vanguard-Astra project acquisition plan and procurement up to contract placement, the initial software stack environment planned for the Vanguard-Astra platform (Astra), a description of how the communities of users will utilize the platform during the transition from the open network to the classified network, and initial performance results.

More Details

The case for semi-permanent cache occupancy

ACM International Conference Proceeding Series

Dosanjh, Matthew G.; Ghazimirsaeed, S.M.; Grant, Ryan; Schonbein, William W.; Levenhagen, Michael; Bridges, Patrick G.; Afsahi, Ahmad

The performance critical path for MPI implementations relies on fast receive side operation, which in turn requires fast list traversal. The performance of list traversal is dependent on data-locality; whether the data is currently contained in a close-to-core cache due to its temporal locality or if its spacial locality allows for predictable pre-fetching. In this paper, we explore the effects of data locality on the MPI matching problem by examining both forms of locality. First, we explore spacial locality, by combining multiple entries into a single linked list element, we can control and modify this form of locality. Secondly, we explore temporal locality by utilizing a new technique called “hot caching”, a process that creates a thread to periodically access certain data, increasing its temporal locality. In this paper, we show that by increasing data locality, we can improve MPI performance on a variety of architectures up to 4x for micro-benchmarks and up to 2x for an application.

More Details

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Foulk, James W.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops Ipdpsw 2018

Foulk, James W.; Grant, Ryan; Foulk, James W.; Levenhagen, Michael; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details
Results 1–50 of 130
Results 1–50 of 130