AMD Meeting ? Trilinos / Kokkos Kernels requirements
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
We present a numerical modeling workflow based on machine learning (ML) which reproduces the the total energies produced by Kohn-Sham density functional theory (DFT) at finite electronic temperature to within chemical accuracy at negligible computational cost. Based on deep neural networks, our workflow yields the local density of states (LDOS) for a given atomic configuration. From the LDOS, spatially-resolved, energy-resolved, and integrated quantities can be calculated, including the DFT total free energy, which serves as the Born-Oppenheimer potential energy surface for the atoms. We demonstrate the efficacy of this approach for both solid and liquid metals and compare results between independent and unified machine-learning models for solid and liquid aluminum. Our machine-learning density functional theory framework opens up the path towards multiscale materials modeling for matter under ambient and extreme conditions at a computational scale and cost that is unattainable with current algorithms.
Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
ACM International Conference Proceeding Series
Sparse triangular solver is an important kernel in many computational applications. However, a fast, parallel, sparse triangular solver on a manycore architecture such as GPU has been an open issue in the field for several years. In this paper, we develop a sparse triangular solver that takes advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We implemented our solver using Kokkos and Kokkos Kernels such that our solver is portable to different manycore architectures. This has the additional benefit of allowing our triangular solver to use the team-level kernels and take advantage of the hierarchical parallelism available on the GPU. We compare the effects of different scheduling schemes on the performance and also investigate an algorithmic variant called the partitioned inverse. Our performance results on an NVIDIA V100 or P100 GPU demonstrate that our implementation can be 12.4 × or 19.5 × faster than the vendor optimized implementation in NVIDIA's CuSPARSE library.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary physics codes in the ExaWind project are Nalu-Wind, which is an unstructured-grid solver for the acoustically incompressible Navier-Stokes equations, and OpenFAST, which is a whole-turbine simulation code. The Nalu-Wind model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step. In this report we evaluated GPU-performance baselines for the linear solvers in the Trilinos and hypre solver stacks using two representative Nalu-Wind simulations: an atmospheric boundary layer precursor simulation on a structured mesh, and a fixed-wing simulation using unstructured overset meshes. Both strong-scaling and weak-scaling experiments were conducted on the OLCF supercomputer Summit and similar proxy clusters. We focused on the performance of multi-threaded Gauss-Seidel and two-stage Gauss-Seidel that are extensions of classical Gauss-Seidel; of one-reduce GMRES, a communication-reducing variant of the Krylov GMRES; and algebraic multigrid methods that incorporate the afore-mentioned methods. The team has established that AMG methods are capable of solving linear systems arising from the fixed-wing overset meshes on CPU, a critical intermediate result for ExaWind FY20 Q3 and Q4 milestones. For the fixed-wing strong-scaling study (model with 3M grid-points), the team identified that Nalu-Wind simulations with the new Trilinos and hypre solvers scale to modest GPU counts, maintaining above 70% efficiency up to 6 GPUs. However, there still remain significant bottlenecks to performance: matrix assembly (hypre), AMG setup (hypre and Trilinos) In the weak-scaling experiments (going from 0.4M to 211M gridpoints), it's shown that the solver apply phases are faster on GPUs, but that Nalu-Wind simulation times grow, primarily due to the multigrid-setup process. Finally, based on the report outcomes, we propose a linear solver path-forward for the remainder of the ExaWind project. Near term, the NREL team will continue their work on GPU-based linear-system assembly. They will also investigate how the use of alternatives to the NVIDIA UVM (unified virtual memory) paradigm affects performance. Longer term, the NREL team will evaluate algorithmic performance on other types of accelerators and merge their improvements back to the main hypre repository branch. Near term, the Trilinos team will address performance bottlenecks identified in this milestone, such as implementing a GPU-based segregated momentum solve and reusing matrix graphs across linear-system assembly phases. Longer term, the Trilinos team will do detailed analysis and optimization of multigrid setup.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020
Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no scalable, distributed-memory, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. We use Sphnyx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on GPU performance. We perform those evaluations on irregular graphs, because state-of-the-art partitioners have the most difficulty on them. We demonstrate that Sphynx is up to 17x faster on GPUs compared to the case on CPUs, and up to 580x faster compared to a state-of-the-art multilevel partitioner. Sphynx provides a robust alternative for applications looking for a GPU-based partitioner.
The ExaLearn miniGAN team (Ellis and Rajamanickam) have released miniGAN, a generative adversarial network(GAN) proxy application, through the ECP proxy application suite. miniGAN is the first machine learning proxy application in the suite (note: the ECP CANDLE project did previously release some benchmarks) and models the performance for training generator and discriminator networks. The GAN's generator and discriminator generate plausible 2D/3D maps and identify fake maps, respectively. miniGAN aims to be a proxy application for related applications in cosmology (CosmoFlow, ExaGAN) and wind energy (ExaWind). miniGAN has been developed so that optimized mathematical kernels (e.g., kernels provided by Kokkos Kernels) can be plugged into to the proxy application to explore potential performance improvements. miniGAN has been released as open source software and is available through the ECP proxy application website (https://proxyapps.exascaleproject.ordecp-proxy-appssuite/) and on GitHub (https://github.com/SandiaMLMiniApps/miniGAN). As part of this release, a generator is provided to generate a data set (series of images) that are inputs to the proxy application.
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Graph neural networks (GNNs) are an emerging model for learning graph embeddings and making predictions on graph structured data. However, robustness of graph neural networks is not yet well-understood. In this work, we focus on node structural identity predictions, where a representative GNN model is able to achieve near-perfect accuracy. We also show that the same GNN model is not robust to addition of structural noise, through a controlled dataset and set of experiments. Finally, we show that under the right conditions, graph-augmented training is capable of significantly improving robustness to structural noise.
SIAM Journal on Matrix Analysis and Applications
We propose a new algorithm for the fast solution of large, sparse, symmetric positive-definite linear systems, spaND (sparsified Nested Dissection). It is based on nested dissection, sparsification, and low-rank compression. After eliminating all interiors at a given level of the elimination tree, the algorithm sparsifies all separators corresponding to the interiors. This operation reduces the size of the separators by eliminating some degrees of freedom but without introducing any fill-in. This is done at the expense of a small and controllable approximation error. The result is an approximate factorization that can be used as an efficient preconditioner. We then perform several numerical experiments to evaluate this algorithm. We demonstrate that a version using orthogonal factorization and block-diagonal scaling takes fewer CG iterations to converge than previous similar algorithms on various kinds of problems. Furthermore, this algorithm is provably guaranteed to never break down and the matrix stays symmetric positive-definite throughout the process. We evaluate the algorithm on some large problems show it exhibits near-linear scaling. The factorization time is roughly \scrO (N), and the number of iterations grows slowly with N.
Lecture Notes in Computational Science and Engineering
This article describes a parallel implementation of a two-level overlapping Schwarz preconditioner with the GDSW (Generalized Dryja–Smith–Widlund) coarse space described in previous work [12, 10, 15] into the Trilinos framework; cf. [16]. The software is a significant improvement of a previous implementation [12]; see Sec. 4 for results on the improved performance.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
As computer architectures are rapidly evolving (e.g. those designed for exascale), multiple portability frameworks have been developed to avoid new architecture-specific development and tuning. However, portability frameworks depend on compilers for auto-vectorization and may lack support for explicit vectorization on heterogeneous platforms. Alternatively, programmers can use intrinsics-based primitives to achieve more efficient vectorization, but the lack of a gpu back-end for these primitives makes such code non-portable. A unified, portable, Single Instruction Multiple Data (simd) primitive proposed in this work, allows intrinsics-based vectorization on cpus and many-core architectures such as Intel Knights Landing (knl), and also facilitates Single Instruction Multiple Threads (simt) based execution on gpus. This unified primitive, coupled with the Kokkos portability ecosystem, makes it possible to develop explicitly vectorized code, which is portable across heterogeneous platforms. The new simd primitive is used on different architectures to test the performance boost against hard-to-auto-vectorize baseline, to measure the overhead against efficiently vectroized baseline, and to evaluate the new feature called the “logical vector length” (lvl). The simd primitive provides portability across cpus and gpus without any performance degradation being observed experimentally.
Lecture Notes in Computational Science and Engineering
This article describes a parallel implementation of a two-level overlapping Schwarz preconditioner with the GDSW (Generalized Dryja–Smith–Widlund) coarse space described in previous work [12, 10, 15] into the Trilinos framework; cf. [16]. The software is a significant improvement of a previous implementation [12]; see Sec. 4 for results on the improved performance.
Abstract not provided.
Abstract not provided.
Abstract not provided.