Publications

9 Results
Skip to search filters

ADELUS: A Performance-Portable Dense LU Solver for Distributed-Memory Hardware-Accelerated Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Dang, Vinh Q.; Kotulski, J.D.; Rajamanickam, Sivasankaran R.

Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.

More Details

Performance portable sparse approximate inverse preconditioner for EFIE equations

Proceedings of the 2017 19th International Conference on Electromagnetics in Advanced Applications, ICEAA 2017

Bettencourt, Matthew T.; Zinser, Brian; Jorgenson, Roy E.; Kotulski, J.D.

A block base sparse approximate inverse preconditioner for the electric field integral equations is documented and tested. It utilized the Kokkos library for performance portability and shows superior performance when compared to a direct method, 36x faster for a 112K DOF problem. Furthermore, due to the abstractions available in the Kokkos library it allows one to migrate from CPU to GPU in a trivial way.

More Details
9 Results
9 Results