Skip to content

Conversation

@hzhou
Copy link
Contributor

@hzhou hzhou commented Nov 12, 2025

Pull Request Description

We will have two csel selection interface, using MPI_Bcast for example -

  • MPIR_Bcast_impl - called from binding, i.e. MPI_Bcast, it selects the main json tree

  • MPIR_Bcast_auto(..., MPIR_CSEL_TYPE__AUTO) - called from compositional algorithms. The last csel_type parameter can further specify which json subtree to use.

  • There is also MPIR_Bcast_fallback - this is used by MPICH internally for stability. _fallback algorithms are not subject to json tuning.

In principle, MPIR_CSEL_TAYPE__AUTO subtrees should not select another compositional algorithm, thus ensuring we won't get into recursive dead-loops. But we can allow some exceptions. For example, it is generally okay to select an nb algorithm, since it transitions from "blocking" to "nonblocking", thus won't become recursive.

[skip warnings]

Author Checklist

  • Provide Description
    Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
  • Commits Follow Good Practice
    Commits are self-contained and do not do two things at once.
    Commit message is of the form: module: short description
    Commit message explains what's in the commit.
  • Passes All Tests
    Whitespace checker. Warnings test. Additional tests via comments.
  • Contribution Agreement
    For non-Argonne authors, check contribution agreement.
    If necessary, request an explicit comment from your companies PR approval manager.

hzhou added 30 commits November 13, 2025 11:13
We will use a single-level JSON for algorithm selection including
device-specific algorithms. Remove the collective ADI for now. We'll add
the mechanism of selecting device-level algorithms later.

gen_coll.py is updated to skip calling MPID_ collectives.

Device collective CVARs are removed.
Do not hide the script. Move it to maint/ as the reset of the autogen
scripts.
We will add the mechanism of selecting device-layer algorithms later.
Temporarily comment out the composition code that calls netmod/shm
collectives since we will remove these apis next.

Some NULL composition functions are removed.
We will replace the device-algorithm selelction later at MPIR-layer.
ipc_p2p.h references MPIDI_POSIX_am_eager_limit, which is defined in
shm_am.h.

Not sure how did it work before.
The fallback collectives (e.g. MPIR_Bcast_fallback) are manual "auto"
functions that may not be the best algorithms for the system, but
are sufficient for internal usages during init and object
constricutions.
Use MPIR_Barrier_fallback instead of MPIR_Bcast_allcomm_auto (doesn't
exist now).
The auto selection should take care of restrictions. Error rather than
fallback.

If user use CVAR to select specific algorithm, we should check
restrictions before jumping the the algorithm. We will design a common
fallback handling there.
Eventually we will make nonblocking compositional algorithms work by
having the JSON tree check the sched framework types. For now, remove
the json search and just use fallbacks.
For consistency, call MPIR collectives (e.g. MPIR_Bcast) in
compositional algorithms.

TODO: rewrite compositional algorithms using coll_sig and container.
Rename coll_info to coll_sig for type MPIR_Csel_coll_sig_s. This is
the convention other than in csel.c. Let's make it consistent.
Pass coll_sig by pointer rather than by copy. The structure can be big
and pass by pointer avoids the extra copy. Also allows coll_sig to serve
as a persistent state throughout the collective selelction chain.
MPIR_Csel_coll_sig_s contains input arguments to an MPI collective.
MPII_Csel_container_s contains extra parameters to an MPI algorithm.
We define a unified collective algo function interface using both
MPIR_Csel_coll_sig_s and MPII_Csel_container_s. Both structures will
need MPID extensions to support MPID-specific algorithms. Defining both
in the same header for easier management.
Two main JSON tuning files. "coll_composition.json" selects
compositional algorithms, and "coll_selection.json" selects
basic algorithms. Basic algorithm does not call another collectives.
Add the two auto functions that executes CSEL search.
A universal nb alglorithm for blocking collectives.
They are replaced by MPIR_Coll_nb.
Define an abstract collective algorithm function interface that uses
MPII_Csel_container_s and MPIR_Csel_coll_sig_s. Both structure will have
mechanism for device layer to extend with its own fields.

All collective algorithms will be populated in a global
MPIR_Coll_algo_table. Device layer can fillin its device-specific
entries in MPID_Init.

MPIR_Coll_auto and MPIR_Coll_composition_auto serve as auto collective
functions that runs Csel search then call the selected algorithm by
looking up the entries from MPIR_Coll_algo_table.
Dump a wrapper function for each algorithm that takes (cont, coll_sig).

Separately Declare algorithm prototypes.

Separately Decleare sched_auto prototypes.
Generate collective implement functions that assemble coll_sig and call
MPIR_Coll_composition_auto.
Current compositional algorithms call MPIR collectives. We will refactor
them later. But for now, generate a wrapper MPIR functions that calls
_impl functions.
Move MPIR_Csel_coll_sig_s and MPII_Csel_container_s definitions to
mpir_coll.h since they are now common interface to all collective
algorithms.

Move the rest of the csel header to coll_csel.h and only include it
where needed.
We can add separate caching mechanism to expedite search later.
For now, simplify by directly use csel_node_s.
Generate those IDs, table entries, and json parsing from
coll_algorithms.txt.
hzhou added 18 commits November 13, 2025 11:15
We can easily create alias algorithms by defining a separate algorithm
function that calls the generic routines. Thus, simplify the design by
removing the alias feature in coll_algorithms.txt. This ensures
a one-to-one entry for each collective algorithms with a matching
algorithm function.

Add iallreduce tsp_recexch algorithm since the function is used in
multiple places. Similarly, add ibcast tsp_scatterv_allgatherv algorithm
since it is used elsewhere internally.

Remove enums such as IREDUCE_RECEXCH_TYPE_DISTANCE_DOUBLING/HALVING. The
actual parameter is more like a boolean.
Replaced by MPIR_Csel_composition and MPIR_Csel_selection.
Add MPIR_init_coll_sig and MPID_init_coll_sig so we can add arbitrary
attr bits or additional fields without hacking maint/gen_coll.py.
Generate tables based on coll_algorithms.txt and use the tables to
facilitate csel parsing and error reporting.

If user sets an algorithm CVAR, directly construct a container for the
cvar-specified algorithm and call it if all restrictions are met.

All restrictive checkers are represented by either a bit in
coll_sig->attr or a boolean checker function. All restrictions and their
checkers are configured in coll_algorithms.txt.
In coll_algorithms.txt, add "inline" attribute to skip add prototype for
the corresponding algorithm function since it is inlined in the headers.

Add "func_name" to directly specify algorithm function name.

Add "macro_guard" to specify a preproc condition for the algorithm
function. For example, the ch4 posix algorithm function needs be
protected by "#if defined(MPIDI_CH4_SHM_POSIX)" (to be defined).
Add conditional condition - the condition function only can be called
inside preprocess macro guard.

We need generate another header file, coll_autogen.h, that are loaded
after mpidpos.h. "coll_algos.h" goes into mpir_coll.h, which is included
in between mpidpre.h and mpidpost.h.

Refactor a bit so all the conditions parsing logics are wrapped in
functions such as get_conditon_name, get_condition_func, etc. and they
are defined together.
Sometime we may want to do differently between restriction-check and
condition check. For example, algorithm like release_gather normally
gets selelcted only after user calls the collective certain number of
times. But if user selects the algorithm by CVAR, it won't make sense to
do this repeat check in the restriction-check.
Rather than add individual boolean flags, use bit mask "flags" instead.
It is easier to make sure we zero-initialize all the flags that way.
Provide a simple mechanism for a rank to dump collective algorithm
counters.

Set MPIR_CVAR_DUMP_COLL_ALGO_COUNTERS to the global rank of the process
that we want it to dump since it is undesirable for every process to
dump yet it does not always makes sense for rank 0 to dump especially
when we don't always use comm world.

It is counted in the CSEL framework so internal collectives are not
counted when we internally use _fallback algorithms.
Enable CVARs and JSONs to select ch4-posix layer release_gather
algorithms.

Select MPIDI_POSIX_mpi_bcast_release_gather if it passes
MPIDI_CH4_release_gather condition check, which only passes if comm is
an posix intranode comm.
Extend the previous commit to activate release_gather algorithm for
reduce, allreduce, and barrier.
@hzhou hzhou force-pushed the 2510_csel_comp branch 4 times, most recently from dbdcefd to 76e096d Compare November 13, 2025 19:31
Almost all internal usage of collectives should use fallback collectives
preferring stability rather than potential performance tuning from CSEL.
When user use a CVAR to selelct an algorithm, if the algorithm is a
compositional algorithm, we may run into recursive dead-loop. The
following measures breaks the cycle:

* Proper compositional algorithm should specify complete restriction
check that fails recursively. The default MPIR_CVAR_COLLECTIVE_FALLBACK
is silent and will make the CVAR selecting composition algo work.

* Non compositional algorithms are safe to be selected by CVAR since it
will never create recursive situation. Some "compositional" algorithms
that does not recurse into the same coll_type also fit into this
category.

* For benchmark tests, user may want to set MPIR_CVAR_COLLECTIVE_FALLBACK=error
so they get confirmed to be testing the correct algorithms.

* It is possible for developer to write an algorithm that doesn't have
good restriction check that may end up in a recursive situation, for
now, this is a bug with fault on developers. TODO: add a mechanism to
detect infinite recurse loop.
MPIR_Xxx_impl will call MPIR_Xxx_auto with MPIR_CSEL_ENTRY__MAIN, which
selects algorithm using the main JSON that includes compositional
algorithms. Internally, especially when compositional algorithms, will
call MPIR_Xxx_auto with MPIR_CSEL_ENTRY__AUTO. This prevents a recursive
dead-loop situation.
Call auto functions, e.g. MPIR_Bcast_auto, instead of calling SHM and NM
interfaces. We assume the JSON selections will select appropriate
algorithms including shm-specific algorithms.
Compositional algorithms should call MPIR_Xxx_auto routines to prevent
recursive dead-loop. The _auto CSEL selection should not select back into
compositional algorithms (with some exceptions). One exception is the
nb algorithm. It is generally okay to select nb algorithm in _auto
because nb transfers from blocking to nonblocking thus it won't create
recursive situation.
It is not safe in MPIX_Allreduce_enqueue to call MPIR_Allreduce_auto
in a callback since we can't ensure it does not collide with another
on-going collectives.
Recast the collective buffer swap as a composition algorithm.

TODO: support all collective types.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant