Center for Computing Research (CCR)

Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems

Parallel Computing

Acer, Seher A.; Boman, Erik G.; Glusa, Christian A.; Rajamanickam, Sivasankaran R.

Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no distributed-memory-parallel, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. In Sphynx, we allow using different preconditioners and exploit their unique advantages. We use Sphynx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on the GPU performance. We perform those evaluations on two distinct classes of graphs: regular (such as meshes, matrices from finite element methods) and irregular (such as social networks and web graphs), and show that different settings and preconditioners are needed for these graph classes. The experimental results on the Summit supercomputer show that Sphynx is the fastest alternative on irregular graphs in an application-friendly setting and obtains a partitioning quality close to ParMETIS on regular graphs. When compared to nvGRAPH on a single GPU, Sphynx is faster and obtains better balance and better quality partitions. Sphynx provides a good and robust partitioning method across a wide range of graphs for applications looking for a GPU-based partitioner.

More Details

TYPE Conference Presenation YEAR 2021

Scopus OSTI DOI

True Load Balancing for Matricized Tensor Times Khatri-Rao Product

IEEE Transactions on Parallel and Distributed Systems

Abubaker, Nabil; Acer, Seher A.; Aykanat, Cevdet

MTTKRP is the bottleneck operation in algorithms used to compute the CP tensor decomposition. For sparse tensors, utilizing the compressed sparse fibers (CSF) storage format and the CSF-oriented MTTKRP algorithms is important for both memory and computational efficiency on distributed-memory architectures. Existing intelligent tensor partitioning models assume the computational cost of MTTKRP to be proportional to the total number of nonzeros in the tensor. However, this is not the case for the CSF-oriented MTTKRP on distributed-memory architectures. We outline two deficiencies of nonzero-based intelligent partitioning models when CSF-oriented MTTKRP operations are performed locally: failure to encode processors' computational loads and increase in total computation due to fiber fragmentation. We focus on existing fine-grain hypergraph model and propose a novel vertex weighting scheme that enables this model encode correct computational loads of processors. We also propose to augment the fine-grain model by fiber nets for reducing the increase in total computational load via minimizing fiber fragmentation. In this way, the proposed model encodes minimizing the load of the bottleneck processor. Parallel experiments with real-world sparse tensors on up to 1024 processors prove the validity of the outlined deficiencies and demonstrate the merit of our proposed improvements in terms of parallel runtimes.

More Details

TYPE Journal Article YEAR 2021

Scopus OSTI DOI

Rising Star: Seher Acer

Acer, Seher A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

Performance-Portable and Scalable Algorithms on Sparse/Dense Matrices and Graphs

Acer, Seher A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

SPHYNX: Spectral partitioning for HYbrid and aXelerator-enabled systems

Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020

Acer, Seher A.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no scalable, distributed-memory, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. We use Sphnyx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on GPU performance. We perform those evaluations on irregular graphs, because state-of-the-art partitioners have the most difficulty on them. We demonstrate that Sphynx is up to 17x faster on GPUs compared to the case on CPUs, and up to 580x faster compared to a state-of-the-art multilevel partitioner. Sphynx provides a robust alternative for applications looking for a GPU-based partitioner.

More Details

TYPE Conference Poster YEAR 2020

Scopus OSTI

SPHYNX: Spectral Partitioning for HYbrid aNd aXelerator-based systems

Acer, Seher A.; Boman, Erik G.; Rajamanickam, Sivasankaran R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Porting Sphynx to GPU-enabled systems

Acer, Seher A.

Abstract not provided.

More Details

TYPE Presentation YEAR 2020

OSTI

ExaGraph: Parallel Partitioning and Coloring for Exascale Applications

Acer, Seher A.; Boman, Erik G.; Rajamanickam, Sivasankaran R.; Bogle, Ian A.; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher A.; Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Berry, Jonathan W.; Wolf, Michael W.; Catalyurek, Umit V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI DOI

2D Block Cyclic Partitioning for Sparse Matrices

Acer, Seher A.; Boman, Erik G.; Aykanat, Cevdet A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Kokkos Kernels

Rajamanickam, Sivasankaran R.; Acer, Seher A.; Berger-Vergiat, Luc B.; Dang, Vinh Q.; Ellingwood, Nathan D.; Kelley, Brian M.; Kim, Kyungjoo K.; Trott, Christian R.; Wilke, Jeremiah J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

ExaGraph: Combinatorial Methods for Enabling Exascale Applications

Halappanavar, Mahantesh H.; Buluc, Aydin B.; Boman, Erik G.; pothen, alex p.; Tumeo, Antonino T.; Khan, Arif K.; Minutoli, Marco M.; Tallent, Nathan T.; Gawande, Nitin G.; Ekanayake, Saliya E.; Ghosh, Sayan G.; Acer, Seher A.; Rajamanickam, Sivasankaran R.; Ferdous, S.M.F.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Scalable Triangle Counting on Distributed-Memory Systems

Acer, Seher A.; Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Catalyurek, Umit V.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI DOI

Scalable triangle counting on distributed-memory systems

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Acer, Seher A.; Yasar, Abdurrahman; Rajamanickam, Sivasankaran R.; Wolf, Michael W.; Catalyurek, Umit V.

Triangle counting is a foundational graph-analysis kernel in network science. It has also been one of the challenge problems for the 'Static Graph Challenge'. In this work, we propose a novel, hybrid, parallel triangle counting algorithm based on its linear algebra formulation. Our framework uses MPI and Cilk to exploit the benefits of distributed-memory and shared-memory parallelism, respectively. The problem is partitioned among MPI processes using a two-dimensional (2D) Cartesian block partitioning. One-dimensional (1D) rowwise partitioning is used within the Cartesian blocks for shared-memory parallelism using the Cilk programming model. Besides exhibiting very good strong scaling behavior in almost all tested graphs, our algorithm achieves the fastest time on the 1.4B edge real-world twitter graph, which is 3.217 seconds, on 1,092 cores. In comparison to past distributed-memory parallel winners of the graph challenge, we demonstrate a speed up of 2.7× on this twitter graph. This is also the fastest time reported for parallel triangle counting on the twitter graph when the graph is not replicated.

More Details

TYPE Conference Poster YEAR 2019

Scopus OSTI DOI