3.14. Teko Heuristic Multiphysics Preconditioner

Insert some useful notes here about challenge of multiphysics preconditioning.

3.14.1. Convergence of Block Gauss-Seidel

We start our investigation of general multiphysics preconditioning by considering the use of block Gauss-Seidel, as done by Cyr and coworkers [33]. With the blocked linear operator defined as

(3.193) $\Aop = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n_b} \\ A_{21} & A_{22} & \cdots & A_{2n_b} \\ \vdots & \vdots & \ddots & \vdots \\ A_{n_b1} & A_{n_b2} & \cdots & A_{n_b n_b} \end{bmatrix},$

the corresponding block Gauss-Seidel preconditioner is denoted as

(3.194) $\Mop^{-1} = \begin{bmatrix} A_{11} & A_{12} & \cdots & A_{1n_b} \\ & A_{22} & \cdots & A_{2n_b} \\ & & \ddots & \vdots \\ & & & A_{n_b n_b} \end{bmatrix}^{-1}.$

To analyze the convergence rate of the block Gauss-Seidel preconditioner, let us consider the stationary iterative method

(3.195) $\underline{x}^{k+1} = \underline{x}^{k} + \Mop^{-1}\left(\underline{b} - \Aop\underline{x}^{k}\right).$

The error at each step of this scheme is

(3.196) $\underline{e}^{k+1} = \left(I-\Mop^{-1}\Aop\right)\underline{e}^{k}.$

Applying $\|\cdot\|_{\mtm}^2$ to both sides of (3.196), we have that

(3.197) $\|\underline{e}^{k+1}\|_{\mtm}^2 &= \|\left(I-\Mop^{-1}\Aop\right)\underline{e}^{k}\|_{\mtm}^2 \nonumber \\ &= \left(\underline{e}^{k}\right)^T \left(I-\Mop^{-1}\Aop\right)^T\mtm\left(I-\Mop^{-1}\Aop\right)\underline{e}^{k}\nonumber \\ &= \left(\underline{e}^{k}\right)^T \left(\Mop-\Aop\right)^T\left(\Mop-\Aop\right) \underline{e}^{k}\nonumber \\ &\leq \|\Mop-\Aop\|_{F}^2 \cdot \|\underline{e}^{k}\|_{2}^2.$

While the bound in (3.197) does not allow for direct comparison between $\underline{e}^{k}$ and $\underline{e}^{k+1}$ given the difference in norms, it nevertheless demonstrates that minimizing $\|\Mop-\Aop\|_{F}^2$ improves the convergence of the block Gauss-Seidel approach.

3.14.2. Linear Ordering Problem for Block Gauss-Seidel

We now consider the linear ordering $\mathcal R\left(\cdot\right)\mathcal C$ , a simultaneous permutation of rows and columns, from which to construct $\permMop$ that reduces the relative error reduction bound in (3.197) when replacing $\Aop$ with $\permAop$ and $\Mop$ with $\permMop$ . Note that the cost of applying $\permAop$ and $\permMop$ does not depend on the ordering. As such, the convergence of the block Gauss-Seidel preconditioner can be improved at no additional expense by optimizing the error bound in (3.197). Substituting the $\permAop$ and $\permMop$ into the right hand side of (3.197) leads to an optimization problem of the form: find block numbering permutation $\mathcal R^*\left(\cdot\right)\mathcal C^*$ such that

(3.198) $\left(\mathcal R^*,\ \mathcal C^*\right) = \operatorname{argmin}_{\forall \mathcal R,\ \mathcal C}\|\permMop-\permAop\|_F^2.$

By constructing a $(n_b,\ n_b)$ matrix $H$ with $H_{ij} = \|\Aop{}_{ij}\|_F^2$ , the discrete optimization problem in (3.198) is equivalent to the matrix triangulation problem: Determine a simultaneous permutation of rows and columns of a matrix that maximizes the sum of the superdiagonal entries [34]. The matrix triangulation problem is further equivalent to the linear ordering problem, which is NP-hard with worst-case $\Omega\left(n_b!\right)$ complexity.

While the linear ordering problem is NP-hard, this problem is well studied with good exact and heuristic methods (see [34] and references therein). We employ the exact method as formulated by Charon and Hudry [35], which uses a branch-and-bound approach with Lagrangian relaxation. We denote solving the linear ordering problem in (3.198) using this scheme as $\texttt{lop}$ .

3.14.3. Block Merging Problem for Block Gauss-Seidel

While the optimization problem in (3.198) can be used to determine the optimal ordering, the convergence of the block Gauss-Seidel scheme may be hampered when two physics are strongly coupled in both directions. In this way, it helps to treat such blocks monolithically, rather than relying on the one-way coupling encapsulated in the block Gauss-Seidel preconditioner. For example,

(3.199) $\underbrace{\begin{bmatrix} \begin{bmatrix} A & B \\ D & E \end{bmatrix} & \begin{bmatrix} C \\ F \end{bmatrix} \\ & J \end{bmatrix}^{-1}}_{\Mprimeop^{-1}} \underbrace{\begin{bmatrix} \begin{bmatrix} A & B \\ D & E \end{bmatrix} & \begin{bmatrix} C \\ F \end{bmatrix} \\ \begin{bmatrix} G & H \end{bmatrix} & J \end{bmatrix}}_{\Aop^{\prime}}$

may be much better conditioned than

(3.200) $\underbrace{\begin{bmatrix} A & B & C \\ & E & F \\ & & J \end{bmatrix}^{-1}}_{\Mop^{-1}} \underbrace{\begin{bmatrix} A & B & C \\ D & E & F \\ G & H & J \end{bmatrix}}_{\Aop},$

assuming that the off-diagonal physics coupling $B$ and $E$ are both large. We denote that a block merging operation is one that treats a two by two sub-block system as a single monolithic sub-block. Similar to the block reordering problem, an optimization problem based on the error bound in (3.197) may be posed: find the minimum number of block merging operations such that

(3.201) $\|\Mprimeop-\Aop^{\prime}\|_F \leq \tau,$

where $\tau$ represents a user-defined target error reduction in (3.197). While the optimization problem posed in (3.201) does not enjoy the same parallel to the linear ordering problem as does the optimization problem in (3.198), we may nevertheless employ a greedy block merging approach that utilizes the linear ordering problem solver. This is demonstrated in Alg. 3.7, which relies on the linear ordering problem solver $\texttt{lop}$ as described in Linear Ordering Problem for Block Gauss-Seidel.

Alg. 3.7 Greedy block merging algorithm

$\begin{algorithmic}[1] \REQUIRE Matrix $\mathcal A$ with $n_b$ blocks, user-provided threshold $\tau$. \ENSURE Re-grouped matrix $\mathcal A^{\prime}$ with $n_b^{\prime} < n_b$ blocks. \STATE $\mathcal A^{(0)} \gets \texttt{lop}(\mathcal A)$ \STATE $k\gets 0$ \WHILE{$\|\tilde{\mathcal M}\left(\mathcal A^{(k)}\right)-\mathcal A^{(k)}\|_F \geq \tau$} \STATE $k\gets k+1$ \STATE $(i^*, j^*) \gets \operatorname{argmax} \limits_{i,j, i > j} \|\mathcal A^{(k-1)}_{ij}\|_F$ \STATE $\mathcal A^{(k)} \gets \texttt{lop}(\texttt{combineBlocks}(\mathcal A^{(k-1)}, i^*, j^*))$ \ENDWHILE \RETURN $\mathcal A^{(k)}$ \end{algorithmic}$

3.14.4. Choosing Sub-block Solvers for Block Gauss-Seidel

The block Gauss-Seidel preconditioner defined in (3.194) assumes that the sub-block inverses required along the diagonal are exact. While the sub-block solvers required are now restricted to relatively easy problems involving one or a few multi-physics interactions, the performance of the block preconditioning strategy is highly dependent on being able to choose good sub-block solvers. We consider the impact of replacing the exact sub-block solvers in the block Gauss-Seidel operator defined in (3.194) with approximate sub-block solvers,

(3.202) $\bgsop^{-1} = \begin{bmatrix} M_{11} & A_{12} & \cdots & A_{1n_b} \\ & M_{22} & \cdots & A_{2n_b} \\ & & \ddots & \vdots \\ & & & M_{n_b n_b} \end{bmatrix}^{-1}$

where each of the $A_{ii}$ entries from (3.194) are replaced with $M_{ii}$ . The effect in (3.202) is that applying $\bgsop^{-1}$ now requires applying $M_{ii}^{-1}$ for each diagonal sub-block, where $M_{ii}^{-1}$ represents a preconditioner or solver approximating the action of $A_{ii}^{-1}$ . A natural question follows: how does the convergence of $\bgsop^{-1}$ compare to $\Mop^{-1}$ ? This question is addressed in (3.203).

(3.203) $\kappa{\left(\mathcal M^{-1}\left(\mathcal A\right)\mathcal A\right)} &\leq \kappa{\left(\mathcal M^{-1}\left(\mathcal A\right) \mathcal{\tilde{M}}\left(\mathcal A\right)\right)} \cdot \kappa{\left(\mathcal{\tilde{M}}^{-1}\left(\mathcal A\right)\mathcal A\right)} \nonumber \\ &\geq \underbrace{ \max_{\forall M_{ii}^{-1}A_{ii}}\left(\kappa{\left(M_{ii}^{-1}A_{ii}\right)}\right) }_{ \text{Sub-block Solver Conditioning} } \cdot \underbrace{ \kappa{\left(\mathcal{\tilde{M}}^{-1}\left(\mathcal A\right)\mathcal A\right)} }_{ \text{Multi-physics Coupling} }$

As a consequence of (3.203), a single resistant sub-block solver can derail the convergence progress of the entire solver. For both robustness and user ease-of-use, we propose using an adaptive sub-block solver approach that monitors the relative residual reduction at each preconditioner application to inform the choice of sub-block preconditioner or solver. The adaptive sub-block preconditioner or solver transitions to a more robust preconditioner or solver if the relative residual reduction becomes larger than some user-specified tolerance $\varepsilon$ ,

(3.204) $\dfrac{\|\underline{b}_{i} - A_{ii}M_{ii}^{-1}\underline{b}_{i}\|_2}{\|\underline{b}_{i}\|_2} > \varepsilon.$

The condition in (3.204) is checked with every sub-block preconditioner/solver evaluation to check that a given sub-block is solved to within the user-specified tolerance. If the condition is true, then the sub-block solver switches to the next solver in a user-specified schedule of sub-block solvers to evaluate. This process continues until a convergent sub-block solver is found, with each successive sub-block solver being more robust, albeit more expensive, than the last. For example, the we employ the following schedule for adaptively solving sub-blocks on CPU architectures:

GMRES(50) preconditioned with Jacobi
GMRES(30) preconditioned with DD(0)-ILU(0)
GMRES(30) preconditioned with DD(1)-ILU(1)
GMRES(30) preconditioned with DD(2)-ILU(2)
Sparse direct solver

On GPU architectures where the relative expense of constructing the DD-ILU based solvers is greater, we allow for additional GMRES iterations:

GMRES(100) preconditioned with Jacobi
GMRES(50) preconditioned with DD(0)-ILU(0)
GMRES(50) preconditioned with DD(1)-ILU(1)
GMRES(50) preconditioned with DD(2)-ILU(2)
Sparse direct solver

This ensures that sub-blocks that are relatively easy to precondition may be done so with an inexpensive preconditioned GMRES solve using Jacobi or DD(0)-ILU(0). At the same time, sub-blocks requiring a more robust solver can utilize a more robust preconditioned GMRES iteration as a high-quality approximate to the sub-block inverse. These two schedules work well for a wide range of sub-block solves encountered. However, they are sub-optimal for sub-blocks associated with the continuity equation. For these cases, we employ the following solver schedule on CPU architectures:

GMRES(30) preconditioned with MueLu
GMRES(30) preconditioned with DD(0)-ILU(0)
GMRES(30) preconditioned with DD(1)-ILU(1)
GMRES(30) preconditioned with DD(2)-ILU(2)

and on GPU architectures:

GMRES(50) preconditioned with MueLu
GMRES(50) preconditioned with DD(0)-ILU(0)
GMRES(50) preconditioned with DD(1)-ILU(1)
GMRES(50) preconditioned with DD(2)-ILU(2)

We note that the specific adaptive sub-block solver settings are subject to change.