Publications Search

We present a new strategy for automatically exploring the design space of key CUDA + MPI programs and providing design rules that discriminate slow from fast implementations. In such programs, the order of operations (e.g., G PU kernels, MPI communication) and assignment of operations to resources (e.g., G PU streams) makes the space of possible designs enormous. Systems experts have the task of redesigning and reoptimizing these programs to effectively utilize each new platform. This work provides a prototype tool to reduce that burden. In our approach, a directed acyclic graph of CUDA and MPI operations defines the design space for the program. Monte-Carlo tree search discovers regions of the design space that have large impact on the program's performance. A sequence-to-vector transformation defines features for each explored im-plementation, and each implementation is assigned a class label according to its relative performance. A decision tree is trained on the features and labels to produce design rules for each class; these rules can be used by systems experts to guide their implementations. We demonstrate our strategy using a key kernel from scientific computing - sparse-matrix vector multiplication - on a platform with multiple MPI ranks and GPU streams.

More Details

TYPE Conference Presenation YEAR 2022

OSTI Scopus

TEMPI: An Interposed MPI Library with Canonical Representation of MPI Datatypes [Slides]

Pearson, Carl W.; Wu, Kun; Chung, I-Hsin; Xiong, Jinjun; Hwu, Wen-Mei

These points are covered in this presentation: Distributed GPU stencil, non-contiguous data; Equivalence of strided datatypes and minimal representation; GPU communication methods; Deploying on managed systems; Large messages and MPI datatypes; Translation and canonicalization; Automatic model-driven transfer method selection; and Interposed library implementation.

More Details

TYPE Conference Presenation YEAR 2021

DOI OSTI

TEMPI: An Interposed MPI Library with Canonical Representation of MPI Datatypes [Poster]

Pearson, Carl W.; Wu, Kun; Chung, I-Hsin; Xiong, Jinjun; Hwu, Wen-Mei

TEMPI provides a transparent non-contiguous data-handling layer compatible with various MPIs. MPI Datatypes are a powerful abstraction for allowing an MPI implementation to operate on non-contiguous data. CUDA-aware MPI implementations must also manage transfer of such data between the host system and GPU. The non-unique and recursive nature of MPI datatypes mean that providing fast GPU handling is a challenge. The same noncontiguous pattern may be described in a variety of ways, all of which should be treated equivalently by an implementation. This work introduces a novel technique to do this for strided datatypes. Methods for transferring non-contiguous data between the CPU and GPU depends on the properties of the data layout. This work shows that a simple performance model can accurately select the fastest method. Unfortunately, the combination of MPI software and system hardware available may not provide sufficient performance. The contributions of this work are deployed on OLCF Summit through an interposer library which does not require privileged access to the system to use

More Details

TYPE Conference Poster YEAR 2021

DOI OSTI

Publications

Search results