Skip to content

Openmp threaded linsolver#7194

Draft
hnil wants to merge 7 commits into
OPM:masterfrom
hnil:openmp-threaded-linsolver
Draft

Openmp threaded linsolver#7194
hnil wants to merge 7 commits into
OPM:masterfrom
hnil:openmp-threaded-linsolver

Conversation

@hnil

@hnil hnil commented Jun 22, 2026

Copy link
Copy Markdown
Member

Add all infrastructure to make openmp as fast as mpi. Missing pice is amg, but with path in amgcl it is near. Then is the post pre which dominates the difference.

hnil and others added 7 commits June 19, 2026 09:34
Replace the sequential A.mv / A.usmv (and the hand-rolled interior-row
loops in GhostLastMatrixAdapter and WellModelGhostLastMatrixAdapter) with
an index-based loop over output rows carrying `#pragma omp parallel for`.

Each output row y[i] is written by exactly one thread and the matrix is
read-only, so the per-row reduction order is unchanged and the result is
bit-identical to the sequential apply. Falls back to the serial loop when
_OPENMP is not defined. This is the highest value/line change for the
single-node (pure-OpenMP) target and leaves the MPI (GhostLast*) path
functionally intact.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add ThreadedScalarProduct.hpp (ThreadedSeqScalarProduct), an OpenMP dot/
norm with a reduction, gated behind a block-count threshold (50k) so it is
non-harmful on small systems where fork/join would dominate. Wire it into
FlexibleSolver as the sequential scalar product in place of
Dune::SeqScalarProduct.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add DILU2.hpp (MultithreadDILU2): a multicolor variant of the OpenMP DILU
preconditioner. The stock MultithreadDILU uses a level-set (wavefront)
schedule whose many thin levels produce one barrier per level (~120 per
apply on a typical grid), which does not scale and regresses past a few
threads. DILU2 instead reorders the unknowns by a graph coloring of the
sparsity pattern: rows of one color are independent, so the triangular
solves need only `#colors` parallel sweeps (a handful for grid graphs).
Apply scaling reaches ~3.7x at 8 threads where wavefront DILU regresses.

Convergence per iteration differs slightly (the factorization is on the
color-permuted matrix), but the coloring -- and therefore the iteration
count -- is fixed regardless of thread count, so results are reproducible.

Register the "dilu2" creator in the serial and MPI preconditioner
factories and accept --linear-solver=dilu2 in setupPropertyTree.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add AmgclPreconditioner.hpp, a Dune PreconditionerWithUpdate wrapping
AMGCL's smoothed-aggregation AMG on the builtin OpenMP backend, for use as
the scalar (1x1) pressure-stage solver in CPR. AMGCL's shared-memory
backend threads the V-cycle far better than Hypre's OpenMP path on a single
node, which makes a fully threaded OpenMP CPR possible (Dune-AMG's
MPI-specific code blocks the OpenMP route).

update() uses a fast numeric-only re-setup (cached Galerkin, reusing the
transfer operators) and reports hasPerfectUpdate()==false so the CPR reuse
interval periodically refreshes the aggregation.

Registered as the "amgcl" creator for 1x1 systems in both the serial
factory and the MPI factory (per-rank, wrapped as a restricted-additive-
Schwarz block preconditioner).

AMGCL is header-only and optional: CMake enables it (HAVE_AMGCL) only when
-DAMGCL_ROOT points at an AMGCL clone, so builds without it are unaffected.

Note: the fast numeric re-setup relies on a rebuild() entry point added in
a small patch to AMGCL's amg.hpp; that patch lives in the AMGCL clone
(AMGCL_ROOT) and should be upstreamed or carried as a tracked patch
alongside this branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a developer note covering the threaded SpMV / scalar product, the
dilu vs dilu2 smoother trade-off, building with AMGCL (-DAMGCL_ROOT,
HAVE_AMGCL) for the "amgcl" CPR pressure stage, and single-node
performance caveats.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The AMGCL CPR pressure stage (AmgclPreconditioner) relies on a fast
numeric-only Galerkin re-setup hooked into AMGCL's existing rebuild()
path. Carry that change as a tracked patch against upstream AMGCL
(amgcl/amg.hpp) plus a README describing what it does and how to apply it,
rather than vendoring a forked AMGCL. Building without AMGCL is unaffected;
this is only needed to build the "amgcl" preconditioner.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…DILU

AmgclPreconditioner now forwards strong_threshold to coarsening.eps_strong
for the ruge_stuben (classical AMG) coarsening, matching how it already
forwards to aggregation coarsenings. Lets the classical-AMG strength be
tuned toward Hypre BoomerAMG's typical 0.5 (AMGCL default is 0.25).

Also refresh patches/amgcl-numeric-galerkin-rebuild.patch to additionally
carry an experimental multicolor DILU relaxation (relaxation/dilu.hpp +
runtime wiring). NOTE: DILU is experimental — it is unstable on AMGCL's
smoothed_aggregation/ruge_stuben coarse operators (use spai0); kept for
completeness and possible future well-conditioned aggregations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@hnil hnil marked this pull request as draft June 22, 2026 12:44
@hnil hnil added the manual:irrelevant This PR is a minor fix and should not appear in the manual label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

manual:irrelevant This PR is a minor fix and should not appear in the manual

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant