Results 1–25 of 144
Skip to search filters

Streaming Generalized Canonical Polyadic Tensor Decompositions

Phipps, Eric T.; Johnson, Nick J.; Kolda, Tamara K.

In this paper, we develop a method which we call OnlineGCP for computing the Generalized Canonical Polyadic (GCP) tensor decomposition of streaming data. GCP differs from traditional canonical polyadic (CP) tensor decompositions as it allows for arbitrary objective functions which the CP model attempts to minimize. This approach can provide better fits and more interpretable models when the observed tensor data is strongly non-Gaussian. In the streaming case, tensor data is gradually observed over time and the algorithm must incrementally update a GCP factorization with limited access to prior data. In this work, we extend the GCP formalism to the streaming context by deriving a GCP optimization problem to be solved as new tensor data is observed, formulate a tunable history term to balance reconstruction of recently observed data with data observed in the past, develop a scalable solution strategy based on segregated solves using stochastic gradient descent methods, describe a software implementation that provides performance and portability to contemporary CPU and GPU architectures and integrates with Matlab for enhanced usability, and demonstrate the utility and performance of the approach and software on several synthetic and real tensor data sets.

More Details

Low-Communication Asynchronous Distributed Generalized Canonical Polyadic Tensor Decomposition

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Lewis, Cannada L.; Phipps, Eric T.

In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.

More Details

A Portable SIMD Primitive Using Kokkos for Heterogeneous Architectures

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Sahasrabudhe, Damodar; Phipps, Eric T.; Rajamanickam, Sivasankaran R.; Berzins, Martin

As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.

More Details

Surrogate-based ensemble grouping strategies for embedded sampling-based uncertainty quantification

Lecture Notes in Computational Science and Engineering

D'Elia, Marta D.; Phipps, Eric T.; Rushdi, A.; Ebeida, Mohamed S.

The embedded ensemble propagation approach introduced in Phipps et al. (SIAM J. Sci. Comput. 39(2):C162, 2017) has been demonstrated to be a powerful means of reducing the computational cost of sampling-based uncertainty quantification methods, particularly on emerging computational architectures. A substantial challenge with this method however is ensemble-divergence, whereby different samples within an ensemble choose different code paths. This can reduce the effectiveness of the method and increase computational cost. Therefore grouping samples together to minimize this divergence is paramount in making the method effective for challenging computational simulations. In this work, a new grouping approach based on a surrogate for computational cost built up during the uncertainty propagation is developed and applied to model advection-diffusion problems where computational cost is driven by the number of (preconditioned) linear solver iterations. The approach is developed within the context of locally adaptive stochastic collocation methods, where a surrogate for the number of linear solver iterations, generated from previous levels of the adaptive grid generation, is used to predict iterations for subsequent samples, and group them based on similar numbers of iterations. The effectiveness of the method is demonstrated by applying it to highly anisotropic advection-dominated diffusion problems with a wide variation in solver iterations from sample to sample. It extends the parameter-based grouping approach developed in D’Elia et al. (SIAM/ASA J. Uncertain. Quantif. 6:87, 2017) to more general problems without requiring detailed knowledge of how the uncertain parameters affect the simulation’s cost, and is also less intrusive to the simulation code.

More Details

Rigorous Data Fusion for Computationally Expensive Simulations

Winovich, Nickolas W.; Rushdi, Ahmad R.; Phipps, Eric T.; Ray, Jaideep R.; Lin, Guang L.; Ebeida, Mohamed S.

This manuscript comprises the final report for the 1-year, FY19 LDRD project "Rigorous Data Fusion for Computationally Expensive Simulations," wherein an alternative approach to Bayesian calibration was developed based a new sampling technique called VoroSpokes. Vorospokes is a novel quadrature and sampling framework defined with respect to Voronoi tessellations of bounded domains in R d developed within this project. In this work, we first establish local quadrature and sampling results on convex polytopes using randomly directed rays, or spokes, to approximate the quantities of interest for a specified target function. A theoretical justification for both procedures is provided along with empirical results demonstrating the unbiased convergence in the resulting estimates/samples. The local quadrature and sampling procedures are then extended to global procedures defined on more general domains by applying the local results to the cells of a Voronoi tessellation covering the domain in consideration. We then demonstrate how the proposed global sampling procedure can be used to define a natural framework for adaptively constructing Voronoi Piecewise Surrogate (VPS) approximations based on local error estimates. Finally, we show that the adaptive VPS procedure can be used to form a surrogate model approximation to a specified, potentially unnormalized, density function, and that the global sampling procedure can be used to efficiently draw independent samples from the surrogate density in parallel. The performance of the resulting VoroSpokes sampling framework is assessed on a collection of Bayesian inference problems and is shown to provide highly accurate posterior predictions which align with the results obtained using traditional methods such as Gibbs sampling and random-walk Markov Chain Monte Carlo (MCMC). Importantly, the proposed framework provides a foundation for performing Bayesian inference tasks which is entirely independent from the theory of Markov chains.

More Details

Software for sparse tensor decomposition on emerging computing architectures

SIAM Journal on Scientific Computing

Phipps, Eric T.; Kolda, Tamara G.

In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.

More Details
Results 1–25 of 144
Results 1–25 of 144