Publications Search

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

SIAM Journal on Scientific Computing

Ballard, Grey B.; Azad, Ariful; Buluc, Aydin; Demmel, James; Grigori, Laura; Schwartz, Oded; Toledo, Sivan; Williams, Samuel

Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdös--Rényi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first implementation of the 3D SpGEMM formulation that exploits multiple (intranode and internode) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI DOI OSTI

Reducing Computation and Communication in Scientific Computing: Connecting Theory to Practice

Ballard, Grey B.

This report summarizes the work produced as part of a Truman Fellowship appointment and its associated LDRD project. The overall goal of the project was to develop better algorithms and implementations for key computational kernels within the field of scientific computing by designing them to be communication efficient, moving as little data as possible. The primary problem of interest was dense matrix multiplication; other computations that were addressed include sparse matrix-matrix multiplication, QR factorization, solving symmetric linear systems, and the symmetric eigendecomposition. The project also involved the study of computations at the intersection of scientific computing and data analysis, including nonnegative matrix factorization for discovering latent factors, Tucker tensor decomposition for data compression, and sampling methods for similarity search.

More Details

TYPE SAND Report YEAR 2016

DOI OSTI

Reducing Communication and Computation in Scientific Computing

Ballard, Grey B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Improving the numerical stability of fast matrix multiplication

SIAM Journal on Matrix Analysis and Applications

Ballard, Grey B.; Benson, Austin R.; Druinsky, Alex; Lipshitz, Benjamin; Schwartz, Oded

Fast algorithms for matrix multiplication, namely those that perform asymptotically fewer scalar operations than the classical algorithm, have been considered primarily of theoretical interest. Apart from Strassen's original algorithm, few fast algorithms have been efficiently implemented or used in practical applications. However, there exist many practical alternatives to Strassen's algorithm with varying performance and numerical properties. Fast algorithms are known to be numerically stable, but because their error bounds are slightly weaker than the classical algorithm, they are not used even in cases where they provide a performance benefit. We argue in this paper that the numerical sacrifice of fast algorithms, particularly for the typical use cases of practical algorithms, is not prohibitive, and we explore ways to improve the accuracy both theoretically and empirically. The numerical accuracy of fast matrix multiplication depends on properties of the algorithm and of the input matrices, and we consider both contributions independently. We generalize and tighten previous error analyses of fast algorithms and compare their properties. We discuss algorithmic techniques for improving the error guarantees from two perspectives: manipulating the algorithms, and reducing input anomalies by various forms of diagonal scaling. Finally, we benchmark performance and demonstrate our improved numerical accuracy.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

Parallel Tucker Compression for Large-Scale Scientific Data

Kolda, Tamara G.; Ballard, Grey B.; Austin, Woody N.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Parallel Tensor Compression for Large-Scale Scientific Data

Kolda, Tamara G.; Ballard, Grey B.; Austin, Woody N.

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8 TB of data. By viewing the data as a dense five way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 10000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.

More Details

TYPE Other Report YEAR 2015

DOI OSTI

A High-Performance Parallel Algorithm for Nonnegative Matrix Factorization

Ballard, Grey B.; Kannan, Ramakrishnan; Park, Haesun

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Algorithmic Improvements for Dense Symmetric Tridiagonalization

Ballard, Grey B.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

Ballard, Grey B.; Hu, Jonathan J.; Siefert, Christopher

We consider the sequence of sparse matrix-matrix multiplications performed during the setup phase of algebraic multigrid. In particular, we show that the most commonly used parallel algorithm is often not the most communication-efficient one for all of the matrix-matrix multiplications involved. By using an alternative algorithm, we show that the communication costs are reduced (in theory and practice), and we demonstrate the performance benefit for both model (structured) and more realistic unstructured problems on large-scale distributed-memory parallel systems. Our theoretical analysis shows that we can reduce communication by a factor of up to 5.4 for a model problem, and we observe in our empirical evaluation communication reductions of factors up to 4.7 for structured problems and 3.7 for unstructured problems. These reductions in communication translate to run-time speedups of up to factors of 2.3 and 2.5, respectively.

More Details

TYPE SAND Report YEAR 2015

DOI OSTI

Reconstructing householder vectors from Tall-Skinny QR

Journal of Parallel and Distributed Computing

Ballard, Grey B.; Demmel, James; Grigori, Laura; Jacquelin, Mathias; Knight, Nicholas; Nguyen, Hong D.; Solomonik, Edgar

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. Experiments on supercomputers demonstrate the benefits of the communication cost improvements: in particular, our experiments show substantial improvements over tuned library implementations for tall-and-skinny matrices. Furthermore, we also provide algorithmic improvements to the Householder QR and CAQR algorithms, and we investigate several alternatives to the Householder reconstruction algorithm that sacrifice guarantees on numerical stability in some cases in order to obtain higher performance.

More Details

TYPE Journal Article YEAR 2015

DOI OSTI

Diamond Sampling for Approximate Maximum All-pairs Dot-product (MAD) Search

Ballard, Grey B.; Pinar, Ali P.; Kolda, Tamara G.; Seshadri, C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Reducing Communication Costs for Sparse Matrix Multiplication within Algebraic Multigrid

Ballard, Grey B.; Hu, Jonathan J.; Siefert, Christopher

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Computing the Largest Entries in a Matrix Product via Sampling

Kolda, Tamara G.; Ballard, Grey B.; Pinar, Ali P.; Comandur, Seshadhri

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Hypergraph Partitioning for Parallel Sparse Matrix-Matrix Multiplication

Ballard, Grey B.; Druinsky, Alex; Knight, Nicholas; Schwartz, Oded

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

OSTI

Contention Bounds for Combinations of Computation Graphs and Network Topologies

Ballard, Grey B.; Demmel, James; Gearhart, Andrew; Lipshitz, Benjamin; Schwartz, Oded; Toledo, Sivan

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2015

DOI OSTI

Brief announcement: Hypergraph parititioning for parallel sparse matrix-matrix multiplication

ACM Transactions on Parallel Computing

Ballard, Grey B.; Druinsky, Alex; Knight, Nicholas; Schwartz, Oded

The performance of parallel algorithms for sparse matrix-matrix multiplication is typically determined by the amount of interprocessor communication performed, which in turn depends on the nonzero structure of the input matrices. In this paper, we characterize the communication cost of a sparse matrix-matrix multiplication algorithm in terms of the size of a cut of an associated hypergraph that encodes the computation for a given input nonzero structure. Obtaining an optimal algorithm corresponds to solving a hypergraph partitioning problem. Furthermore, our hypergraph model generalizes several existing models for sparse matrix-vector multiplication, and we can leverage hypergraph partitioners developed for that computation to improve application-specific algorithms for multiplying sparse matrices.

More Details

TYPE Journal Article YEAR 2014

DOI OSTI

A Framework for Practical Parallel Fast Matrix Multiplication

Ballard, Grey B.; Benson, Austin R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2014

OSTI

Communication-Avoiding Algorithms and Fast Matrix Multiplication

Ballard, Grey B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Algorithmic Improvements for QR Decomposition

Ballard, Grey B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Avoiding Communication in Linear Algebra

Ballard, Grey B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2014

OSTI

Communication-avoiding symmetric-indefinite factorization

SIAM Journal on Matrix Analysis and Applications

Ballard, Grey B.; Becker, Dulceneia; Demmel, James; Dongarra, Jack; Druinsky, Alex; Peled, Inon; Schwartz, Oded; Toledo, Sivan; Yamazaki, Ichitaro

We describe and analyze a novel symmetric triangular factorization algorithm. The algorithm is essentially a block version of Aasen's triangular tridiagonalization. It factors a dense symmetric matrix A as the product A = PLTLT PT , where P is a permutation matrix, L is lower triangular, and T is block tridiagonal and banded. The algorithm is the first symmetric-indefinite communication-avoiding factorization: it performs an asymptotically optimal amount of communication in a two-level memory hierarchy for almost any cache-line size. Adaptations of the algorithm to parallel computers are likely to be communication efficient as well; one such adaptation has been recently published. The current paper describes the algorithm, proves that it is numerically stable, and proves that it is communication optimal.

More Details

TYPE Journal Article YEAR 2014

DOI OSTI Scopus

Reconstructing householder vectors from tall-skinny QR

Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS

Ballard, Grey B.; Demmel, James; Grigori, Laura; Jacquelin, Mathias; Nguyen, Hong D.; Solomonik, Edgar

The Tall-Skinny QR (TSQR) algorithm is more communication efficient than the standard Householder algorithm for QR decomposition of matrices with many more rows than columns. However, TSQR produces a different representation of the orthogonal factor and therefore requires more software development to support the new representation. Further, implicitly applying the orthogonal factor to the trailing matrix in the context of factoring a square matrix is more complicated and costly than with the Householder representation. We show how to perform TSQR and then reconstruct the Householder vector representation with the same asymptotic communication efficiency and little extra computational cost. We demonstrate the high performance and numerical stability of this algorithm both theoretically and empirically. The new Householder reconstruction algorithm allows us to design more efficient parallel QR algorithms, with significantly lower latency cost compared to Householder QR and lower bandwidth and latency costs compared with Communication-Avoiding QR (CAQR) algorithm. As a result, our final parallel QR algorithm outperforms ScaLAPACK and Elemental implementations of Householder QR and our implementation of CAQR on the Hopper Cray XE6 NERSC system. We also provide algorithmic improvements to the ScaLAPACK and CAQR algorithms. © 2014 IEEE.

More Details

TYPE Conference YEAR 2014

Scopus OSTI