MueLu's Algorithmic Performance on GPU
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
SIAM Journal on Scientific Computing
This paper presents a comparison of parallel strong scaling performance of classical and aggregation algebraic multigrid (AMG) preconditioners in the context of wind turbine simulations. Fluid motion is governed by the incompressible Navier--Stokes equations, discretized in space with control-volume finite elements and in time with an inexact projection scheme using an implicit integrator. A discontinuous-Galerkin sliding-mesh algorithm captures rotor motion. The momentum equations are solved with iterative Krylov methods, preconditioned by symmetric Gauss--Seidel (SGS) in Trilinos and $\ell_1$ SGS in hypre. The mass-continuity equation is solved with GMRES preconditioned by AMG and can account for the majority of simulation time. Reducing this continuity solve time is crucial. Wind turbine simulations present two unique challenges for AMG preconditioned solvers: the computational meshes include strongly anisotropic elements, and mesh motion requires matrix reinitialization and computation of preconditioners at each time step. Detailed timing profiles are presented and analyzed, and best practices are discussed for both classical and aggregation-based AMG. Results are presented for simulations of two different wind turbines with up to 6 billion grid points on two different computer architectures. For moving mesh problems that require linear-system reinitialization, the well-established strategy of amortizing preconditioner setup costs over a large number of time steps to reduce the solve time is no longer valid. Instead, results show that faster time to solution is achieved by reducing preconditioner setup costs at the expense of linear-system solve costs. Standard smoothed aggregation with Chebyshev relaxation was found to perform poorly when compared with classical AMG in terms of solve time and robustness. However, plain aggregation was comparable to classical AMG.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The goal of the ExaWind project is to enable predictive simulations of wind farms comprised of many megawatt-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary physics codes in the ExaWind project are Nalu-Wind, which is an unstructured-grid solver for the acoustically incompressible Navier-Stokes equations, and OpenFAST, which is a whole-turbine simulation code. The Nalu-Wind model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as reinitialization of matrices and recomputation of preconditioners is required at every time step. This milestone represents the culmination of several parallel development activities towards the goal of establishing a full-physics simulation capability for modeling wind turbines operating in turbulent atmospheric inflow conditions. The demonstration simulation performed in this milestone is the first step towards the "ground truth" simulation and includes the following components: neutral atmospheric boundary layer inflow conditions generated using a precursor simulation, a hybrid RANS/LES simulation of the wall-resolved turbine geometry, hybridization of the turbulence equations using a blending function approach to transition from the atmospheric scales to the blade boundary layer scales near the turbine, fluid-structure interaction (FSI) that accounts for the complete set of blade deformations (bending, twisting and pitch motion, yaw and tower displacements) by coupling to a comprehensive turbine dynamics code (OpenFAST). The use of overset mesh methodology for the simulations in this milestone presents a significant deviation from the previous efforts where a sliding mesh approach was employed to model the rotation of the turbine blades. The choice of overset meshes was motivated by the need to handle arbitrarily large deformations of the blade and to allow for blade pitching in the presence of a controller and the ease of mesh generation compared to the sliding mesh approach. FSI and the new timestep algorithm used in the simulations were developed in partnership with the A2e High-Fidelity Modeling project. The individual physics components were verified and validated (V%V) through extensive code-to-code comparisons and with experiments where possible. The detailed V&V efforts provide confidence in the final simulation where these physics models were combined together even though no detailed experimental data is available to perform validation of the final configuration. Taken together, this milestone successfully demonstrated the most advanced simulation to date that has been performed with Nalu-Wind.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This is the official user guide for MUELU multigrid library in Trilinos version 12.13 (Dev). This guide provides an overview of MUELU, its capabilities, and instructions for new users who want to start using MUELU with a minimum of effort. Detailed information is given on how to drive MUELU through its XML interface. Links to more advanced use cases are given. This guide gives information on how to achieve good parallel performance, as well as how to introduce new algorithms Finally, readers will find a comprehensive listing of available MUELU options. Any options not documented in this manual should be considered strictly experimental.
Journal of Computational and Applied Mathematics
This work explores the current performance and scaling of a fully-implicit stabilized unstructured finite element (FE) variational multiscale (VMS) capability for large-scale simulations of 3D incompressible resistive magnetohydrodynamics (MHD). The large-scale linear systems that are generated by a Newton nonlinear solver approach are iteratively solved by preconditioned Krylov subspace methods. The efficiency of this approach is critically dependent on the scalability and performance of the algebraic multigrid preconditioner. This study considers the performance of the numerical methods as recently implemented in the second-generation Trilinos implementation that is 64-bit compliant and is not limited by the 32-bit global identifiers of the original Epetra-based Trilinos. The study presents representative results for a Poisson problem on 1.6 million cores of an IBM Blue Gene/Q platform to demonstrate very large-scale parallel execution. Additionally, results for a more challenging steady-state MHD generator and a transient solution of a benchmark MHD turbulence calculation for the full resistive MHD system are also presented. These results are obtained on up to 131,000 cores of a Cray XC40 and one million cores of a BG/Q system.
Abstract not provided.
The goal of the ExaWind project is to enable predictive simulations of wind farms composed of many MW-scale turbines situated in complex terrain. Predictive simulations will require computational fluid dynamics (CFD) simulations for which the mesh resolves the geometry of the turbines, and captures the rotation and large deflections of blades. Whereas such simulations for a single turbine are arguably petascale class, multi-turbine wind farm simulations will require exascale-class resources. The primary code in the ExaWind project is Nalu, which is an unstructured-grid solver for the acousticallyincompressible Navier-Stokes equations, and mass continuity is maintained through pressure projection. The model consists of the mass-continuity Poisson-type equation for pressure and a momentum equation for the velocity. For such modeling approaches, simulation times are dominated by linear-system setup and solution for the continuity and momentum systems. For the ExaWind challenge problem, the moving meshes greatly affect overall solver costs as re-initialization of matrices and re-computation of preconditioners is required at every time step In this Milestone, we examine the effect of threading on the solver stack performance against flat-MPI results obtained from previous milestones using Haswell performance data full-turbine simulations. Whereas the momentum equations are solved only with the Trilinos solvers, we investigate two algebraic-multigrid preconditioners for the continuity equations: Trilinos/Muelu and HYPRE/BoomerAMG. These two packages embody smoothed-aggregation and classical Ruge-Stiiben AMG methods, respectively. In our FY18 Q2 report, we described our efforts to improve setup and solve of the continuity equations under flat-MPI parallelism. While significant improvement was demonstrated in the solve phase, setup times remained larger than expected. Starting with the optimized settings described in the Q2 report, we explore here simulation performance where OpenMP threading is employed in the solver stack. For Trilinos, threading is acheived through the Kokkos abstraction where, whereas HYPRE/BoomerAMG employs straight OpenMP. We examined results for our mid-resolution baseline turbine simulation configuration (229M DOF). Simulations on 2048 Haswell cores explored the effect of decreasing the number of MPI ranks while increasing the number of threads. Both HYPRE and Trilinos exhibited similar overal solution times, and both showed dramatic increases in simulation time in the shift from MPI ranks to OpenMP threads. This increase is attributed to the large amount of work per MPI rank starting at the single-thread configuration. Decreasing MPI ranks, while increasing threads, may be increasing simulation time due to thread synchronization and start-up overhead contributing to the latency and serial time in the model. These result showed that an MPI+OpenMP parallel decomposition will be more effective as the amount per MPI rank computation per MPI rank decreases and the communication latency increases. This idea was demonstrated in a strong scaling study of our low-resolution baseline model (29M DOF) with the Trilinos-HYPRE configuration. While MPI-only results showed scaling improvement out to about 1536 cores, engaging threading carried scaling improvements out to 4128 cores — roughly 7000 DOF per core. This is an important result as improved strong scaling is needed for simulations to be executed over sufficiently long simulated durations (i.e., for many timesteps). In addition to threading work described above, the team examined solver-performance improvements by exploring communication-overhead in the HYPRE-GMRES implementation through a communicationoptimal- GMRE algorithm (CO-GMRES), and offloading compute-intensive solver actions to GPUs. To those ends, a HYPRE mini-app was allow us to easily test different solver approaches and HYPRE parameter settings without running the entire Nalu code. With GPU acceleration on the Summitdev supercomputer, a 20x speedup was achieved for the overall preconditioner and solver execution time for the mini-app. A study on Haswell processors showed that CO-GMRES provides benefits as one increases MPI ranks.