Publications

Results 26–36 of 36
Skip to search filters

ExaWind: Exascale Predictive Wind Plant Flow Physics Modeling

Sprague, M.S.; Ananthan, S.A.; Brazell, M.x.; Glaws, A.G.; De Frahan, M.D.; King, R.K.; Natarajan, M.N.; Rood, J.R.; Sharma, A.L.; Sirydowicz, K.S.; S., Thomas S.; Vijaykumar, G.V.; Yellapantula, S.Y.; Crozier, Paul C.; Berger-Vergiat, Luc B.; Cheung, Lawrence C.; Glaze, D.J.; Hu, Jonathan J.; Knaus, Robert C.; Lee, Dong H.; Okusanya, Tolulope O.; Overfelt, James R.; Rajamanickam, Sivasankaran R.; Sakievich, Philip S.; Smith, Timothy A.; Vo, Johnathan V.; Williams, Alan B.; Yamazaki, Ichitaro Y.; Turner, J.H.; Prokopenko, A.P.; Wilson, R.W.; Moser, R.M.; Melvin, J.M.; Sitaraman, J S.

Abstract not provided.

Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization

Proceedings of PAW-ATM 2019: Parallel Applications Workshop, Alternatives to MPI+X, Held in conjunction with SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis

Pei, Yu; Bosilca, George; Yamazaki, Ichitaro Y.; Ida, Akihiro; Dongarra, Jack

To minimize data movement, many parallel ap-plications statically distribute computational tasks among the processes. However, modern simulations often encounters ir-regular computational tasks whose computational loads change dynamically at runtime or are data dependent. As a result, load imbalance among the processes at each step of simulation is a natural situation that must be dealt with at the programming level. The de facto parallel programming approach, flat MPI (one process per core), is hardly suitable to manage the lack of balance, imposing significant idle time on the simulation as processes have to wait for the slowest process at each step of simulation. One critical application for many domains is the LU factor-ization of a large dense matrix stored in the Block Low-Rank (BLR) format. Using the low-rank format can significantly reduce the cost of factorization in many scientific applications, including the boundary element analysis of electrostatic field. However, the partitioning of the matrix based on underlying geometry leads to different sizes of the matrix blocks whose numerical ranks change at each step of factorization, leading to the load imbalance among the processes at each step of factorization. We use BLR LU factorization as a test case to study the programmability and performance of five different programming approaches: (1) flat MPI, (2) Adaptive MPI (Charm++), (3) MPI + OpenMP, (4) parameterized task graph (PTG), and (5) dynamic task discovery (DTD). The last two versions use a task-based paradigm to express the algorithm; we rely on the PaRSEC run-time system to execute the tasks. We first point out programming features needed to efficiently solve this category of problems, hinting at possible alternatives to the MPI+X programming paradigm. We then evaluate the programmability of the different approaches, detailing our experience implementing the algorithm using each of the models. Finally, we show the performance result on the Intel Haswell-based Bridges system at the Pittsburgh Supercomputing Center (PSC) and analyze the effectiveness of the implementations to address the load imbalance.

More Details

Distributed-memory lattice H-matrix factorization

International Journal of High Performance Computing Applications

Yamazaki, Ichitaro Y.; Ida, Akihiro; Yokota, Rio; Dongarra, Jack

We parallelize the LU factorization of a hierarchical low-rank matrix (H-matrix) on a distributed-memory computer. This is much more difficult than the H-matrix-vector multiplication due to the dataflow of the factorization, and it is much harder than the parallelization of a dense matrix factorization due to the irregular hierarchical block structure of the matrix. Block low-rank (BLR) format gets rid of the hierarchy and simplifies the parallelization, often increasing concurrency. However, this comes at a price of losing the near-linear complexity of the H-matrix factorization. In this work, we propose to factorize the matrix using a “lattice H-matrix” format that generalizes the BLR format by storing each of the blocks (both diagonals and off-diagonals) in the H-matrix format. These blocks stored in the linear complexity of the-matrix format are referred to as lattices. Thus, this lattice format aims to combine the parallel scalability of BLR factorization with the near-linear complexity of linear complexity of the-matrix factorization. We first compare factorization performances using the L-matrix, BLR, and lattice H-matrix formats under various conditions on a shared-memory computer. Our performance results show that the lattice format has storage and computational complexities similar to those of the H-matrix format, and hence a much lower cost of factorization than BLR. We then compare the BLR and lattice (H-matrix factorization on distributed-memory computers. Our performance results demonstrate that compared with BLR, the lattice format with the lower cost of factorization may lead to faster factorization on the distributed-memory computer.

More Details

Increasing accuracy of iterative refinement in limited floating-point arithmetic on half-precision accelerators

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Luszczek, Piotr; Yamazaki, Ichitaro Y.; Dongarra, Jack

The emergence of deep learning as a leading computational workload for machine learning tasks on large-scale cloud infrastructure installations has led to plethora of accelerator hardware releases. However, the reduced precision and range of the floating-point numbers on these new platforms makes it a non-trivial task to leverage these unprecedented advances in computational power for numerical linear algebra operations that come with a guarantee of robust error bounds. In order to address these concerns, we present a number of strategies that can be used to increase the accuracy of limited-precision iterative refinement. By limited precision, we mean 16-bit floating-point formats implemented in modern hardware accelerators and are not necessarily compliant with the IEEE half-precision specification. We include the explanation of a broader context and connections to established IEEE floating-point standards and existing high-performance computing (HPC) benchmarks. We also present a new formulation of LU factorization that we call signed square root LU which produces more numerically balanced L and U factors which directly address the problems of limited range of the low-precision storage formats. The experimental results indicate that it is possible to recover substantial amounts of the accuracy in the system solution that would otherwise be lost. Previously, this could only be achieved by using iterative refinement based on single-precision floating-point arithmetic. The discussion will also explore the numerical stability issues that are important for robust linear solvers on these new hardware platforms.

More Details
Results 26–36 of 36
Results 26–36 of 36