Publications

Results 1–25 of 66

Search results

Jump to search filters

ECP Report: Update on Proxy Applications and Vendor Interactions

Ang, Jim; Sweeney, Christine; Wolf, Michael W.; Ellis, John E.; Ghosh, Sayan; Kagawa, Ai; Huang, Yunzhi; Rajamanickam, Sivasankaran R.; Ramakrishnaiah, Vinay; Schram, Malachi; Yoo, Shinjae

The ExaLearn miniGAN team (Ellis and Rajamanickam) have released miniGAN, a generative adversarial network(GAN) proxy application, through the ECP proxy application suite. miniGAN is the first machine learning proxy application in the suite (note: the ECP CANDLE project did previously release some benchmarks) and models the performance for training generator and discriminator networks. The GAN's generator and discriminator generate plausible 2D/3D maps and identify fake maps, respectively. miniGAN aims to be a proxy application for related applications in cosmology (CosmoFlow, ExaGAN) and wind energy (ExaWind). miniGAN has been developed so that optimized mathematical kernels (e.g., kernels provided by Kokkos Kernels) can be plugged into to the proxy application to explore potential performance improvements. miniGAN has been released as open source software and is available through the ECP proxy application website (https://proxyapps.exascaleproject.ordecp-proxy-appssuite/) and on GitHub (https://github.com/SandiaMLMiniApps/miniGAN). As part of this release, a generator is provided to generate a data set (series of images) that are inputs to the proxy application.

More Details

Scalable triangle counting on distributed-memory systems

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Acer, Seher A.; Yasar, Abdurrahman; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Catalyurek, Umit V.

Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the 'Static Graph Challenge'. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.

More Details

Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : ((Update on Static Graph Challenge)

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Berry, Jonathan W.; Acer, Seher A.; Wolf, Michael W.; Young, Jeffrey G.; Catalyurek, Umit V.

Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year's graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.

More Details

Sparse Matrix-Matrix Multiplication on Multilevel Memory Architectures: Algorithms and Experiments

Deveci, Mehmet D.; Hammond, Simon D.; Wolf, Michael W.; Rajamanickam, Sivasankaran R.

Architectures with multiple classes of memory media are becoming a common part of mainstream supercomputer deployments. So called multi-level memories offer differing characteristics for each memory component including variation in bandwidth, latency and capacity. This paper investigates the performance of sparse matrix multiplication kernels on two leading highperformance computing architectures — Intel's Knights Landing processor and NVIDIA's Pascal GPU. We describe a data placement method and a chunking-based algorithm for our kernels that exploits the existence of the multiple memory spaces in each hardware platform. We evaluate the performance of these methods w.r.t. standard algorithms using the auto-caching mechanisms Our results show that standard algorithms that exploit cache reuse performed as well as multi-memory-aware algorithms for architectures such as Ki\iLs where the memory subsystems have similar latencies. However, for architectures such as GPUS where memory subsystems differ significantly in both bandwidth and latency, multi-memory-aware methods are crucial for good performance. In addition, our new approaches permit the user to run problems that require larger capacities than the fastest memory of each compute node without depending on the software-managed cache mechanisms.

More Details
Results 1–25 of 66
Results 1–25 of 66