Openmp threaded linsolver#7194
Draft
hnil wants to merge 7 commits into
Draft
Conversation
Replace the sequential A.mv / A.usmv (and the hand-rolled interior-row loops in GhostLastMatrixAdapter and WellModelGhostLastMatrixAdapter) with an index-based loop over output rows carrying `#pragma omp parallel for`. Each output row y[i] is written by exactly one thread and the matrix is read-only, so the per-row reduction order is unchanged and the result is bit-identical to the sequential apply. Falls back to the serial loop when _OPENMP is not defined. This is the highest value/line change for the single-node (pure-OpenMP) target and leaves the MPI (GhostLast*) path functionally intact. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add ThreadedScalarProduct.hpp (ThreadedSeqScalarProduct), an OpenMP dot/ norm with a reduction, gated behind a block-count threshold (50k) so it is non-harmful on small systems where fork/join would dominate. Wire it into FlexibleSolver as the sequential scalar product in place of Dune::SeqScalarProduct. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add DILU2.hpp (MultithreadDILU2): a multicolor variant of the OpenMP DILU preconditioner. The stock MultithreadDILU uses a level-set (wavefront) schedule whose many thin levels produce one barrier per level (~120 per apply on a typical grid), which does not scale and regresses past a few threads. DILU2 instead reorders the unknowns by a graph coloring of the sparsity pattern: rows of one color are independent, so the triangular solves need only `#colors` parallel sweeps (a handful for grid graphs). Apply scaling reaches ~3.7x at 8 threads where wavefront DILU regresses. Convergence per iteration differs slightly (the factorization is on the color-permuted matrix), but the coloring -- and therefore the iteration count -- is fixed regardless of thread count, so results are reproducible. Register the "dilu2" creator in the serial and MPI preconditioner factories and accept --linear-solver=dilu2 in setupPropertyTree. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add AmgclPreconditioner.hpp, a Dune PreconditionerWithUpdate wrapping AMGCL's smoothed-aggregation AMG on the builtin OpenMP backend, for use as the scalar (1x1) pressure-stage solver in CPR. AMGCL's shared-memory backend threads the V-cycle far better than Hypre's OpenMP path on a single node, which makes a fully threaded OpenMP CPR possible (Dune-AMG's MPI-specific code blocks the OpenMP route). update() uses a fast numeric-only re-setup (cached Galerkin, reusing the transfer operators) and reports hasPerfectUpdate()==false so the CPR reuse interval periodically refreshes the aggregation. Registered as the "amgcl" creator for 1x1 systems in both the serial factory and the MPI factory (per-rank, wrapped as a restricted-additive- Schwarz block preconditioner). AMGCL is header-only and optional: CMake enables it (HAVE_AMGCL) only when -DAMGCL_ROOT points at an AMGCL clone, so builds without it are unaffected. Note: the fast numeric re-setup relies on a rebuild() entry point added in a small patch to AMGCL's amg.hpp; that patch lives in the AMGCL clone (AMGCL_ROOT) and should be upstreamed or carried as a tracked patch alongside this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add a developer note covering the threaded SpMV / scalar product, the dilu vs dilu2 smoother trade-off, building with AMGCL (-DAMGCL_ROOT, HAVE_AMGCL) for the "amgcl" CPR pressure stage, and single-node performance caveats. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The AMGCL CPR pressure stage (AmgclPreconditioner) relies on a fast numeric-only Galerkin re-setup hooked into AMGCL's existing rebuild() path. Carry that change as a tracked patch against upstream AMGCL (amgcl/amg.hpp) plus a README describing what it does and how to apply it, rather than vendoring a forked AMGCL. Building without AMGCL is unaffected; this is only needed to build the "amgcl" preconditioner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…DILU AmgclPreconditioner now forwards strong_threshold to coarsening.eps_strong for the ruge_stuben (classical AMG) coarsening, matching how it already forwards to aggregation coarsenings. Lets the classical-AMG strength be tuned toward Hypre BoomerAMG's typical 0.5 (AMGCL default is 0.25). Also refresh patches/amgcl-numeric-galerkin-rebuild.patch to additionally carry an experimental multicolor DILU relaxation (relaxation/dilu.hpp + runtime wiring). NOTE: DILU is experimental — it is unstable on AMGCL's smoothed_aggregation/ruge_stuben coarse operators (use spai0); kept for completeness and possible future well-conditioned aggregations. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add all infrastructure to make openmp as fast as mpi. Missing pice is amg, but with path in amgcl it is near. Then is the post pre which dominates the difference.