Skip to content

[mmu probing] pr03.probe: Add probing executors and executor registry#22541

Open
XuChen-MSFT wants to merge 8 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr03-executors
Open

[mmu probing] pr03.probe: Add probing executors and executor registry#22541
XuChen-MSFT wants to merge 8 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr03-executors

Conversation

@XuChen-MSFT
Copy link
Copy Markdown
Contributor

@XuChen-MSFT XuChen-MSFT commented Feb 23, 2026

Description of PR

Summary:

Implement platform-specific executors and factory pattern for executor management:

Executor Registry:

  • ExecutorRegistry: Factory for environment-specific executor instantiation Supports 'physical' and 'sim' environments with lazy loading

Physical Executors:

  • PfcXoffProbingExecutor: PFC Xoff detection on physical hardware via SAI
  • IngressDropProbingExecutor: Ingress drop detection on physical hardware

Simulation Executors:

  • SimPfcXoffProbingExecutor: Mock PFC Xoff detection for unit/integration tests
  • SimIngressDropProbingExecutor: Mock ingress drop detection for testing

All executors implement the ProbingExecutorProtocol interface, enabling algorithm-executor separation and testability through simulation.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 23, 2026
Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review done. Executor registry and executors look correct; no issues found.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 24, 2026
Copy link
Copy Markdown
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed updates; no new issues found.

wsycqyz
wsycqyz previously approved these changes Feb 25, 2026
Copy link
Copy Markdown
Contributor

@wsycqyz wsycqyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lğtm

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #22541 (MMU Probing: Executors)

🔴 Bug

Missing self.observer None guard in ingress_drop_probing_executor.py
One verbose trace block has:

if self.verbose:
    self.observer.trace(...)

But other verbose blocks correctly use self.verbose and self.observer. If observer is None, this crashes. Should be if self.verbose and self.observer:.

⚠️ Design

  • Class-level mutable state in executor_registry.py_registry: Dict = {} and _loaded_modules: set = set() are shared across tests. clear_registry() must be called reliably; if a test fails before cleanup, subsequent tests see stale state.

  • _ensure_loaded uses bare module nameimportlib.import_module("pfc_xoff_probing_executor") relies on module being on sys.path. If probe directory isn't on path, this silently fails. Consider explicit package prefix.

  • Inconsistent result patterns between PfcXoff and IngressDrop executors — PfcXoff uses explicit all_true/all_false checks. IngressDrop uses (True, results[0]) then checks len(set(results)) > 1. Both should use the same pattern.

Minor

  • Sim executor's noisy executor applies noise to value not threshold — semantically, hardware noise shifts the threshold, not packet count
  • Sim intermittent executor raises exceptions instead of returning (False, False) like physical executors
  • Docstring parameter name mismatch: create() docstring says env but parameter is executor_env

@XuChen-MSFT XuChen-MSFT dismissed stale reviews from wsycqyz and yxieca via dcc1c40 March 18, 2026 08:40
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

Added BadSpot sim executors for anti-oscillation testing (dcc1c40):

  • SimPfcXoffProbingExecutorBadSpot: registered as (pfc_xoff, sim, bad_spot)
  • SimIngressDropProbingExecutorBadSpot: registered as (ingress_drop, sim, bad_spot)

These executors deterministically fail verification at specific packet count values, used to reproduce and validate the range algorithm anti-oscillation backtrack fix in PR #22540 (036f27c).

Related:

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 21, 2026
2 new tests (order 8817-8818) covering the observer None guard fix:
- test_check_verbose_observer_none_guard_result_path: verifies L281 guard
- test_check_verbose_observer_none_guard_exception_path: verifies L290 guard

Both tests bypass the L162 assert by nullifying observer mid-execution
via counter read side_effect, then verify no AttributeError crash.

Related: PR sonic-net#22541 fix (observer None guard)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 22, 2026
6 UTs for IngressDropProbingExecutor (order 8819-8824):
  single detected/not-detected, multi all/none/inconsistent-TF/FT

6 UTs for PfcXoffProbingExecutor (order 8719-8724):
  same 6 input combinations

Covers all result analysis code paths for the consistency pattern
unification in PR sonic-net#22541. All 6 scenarios produce identical outcomes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Re: Inconsistent result patterns between PfcXoff and IngressDrop

Unified both executors to the concise set-dedup pattern (dd592b4):

return_result = (True, results[0])
# Multiple attempts: check consistency (set dedup detects mixed True/False)
if len(results) > 1 and len(set(results)) > 1:
    return_result = (False, False)

PfcXoff previously used an 8-line explicit all_true/all_false/all_equal pattern. Both produce identical outcomes for all 6 input combinations (verified by 12 new UTs in PR #22545 e2fca74).

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Re: Sim noise applies to value not threshold

Mathematically equivalent — (value + noise) >= threshold produces the same result as value >= (threshold - noise). The noise offset is on opposite sides of the comparison but the detection outcome is identical for all inputs. No change needed.

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 22, 2026
2 bug reproduction tests (failure_rate=1.0):
- test_pfc_xoff_intermittent_returns_tuple_not_exception (order 12)
- test_ingress_drop_intermittent_returns_tuple_not_exception (order 14)
  Verify check() returns (False, False) on failure, not raise Exception.

2 success path tests (failure_rate=0.0):
- test_pfc_xoff_intermittent_success_path (order 13)
- test_ingress_drop_intermittent_success_path (order 15)
  Verify normal detection works when no failure occurs.

Related: PR sonic-net#22541 fix (intermittent raise → return)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Re: Sim intermittent raises exceptions instead of returning (False, False)

Fixed (2210ae6): raise Exception(error_msg)return (False, False) in both sim intermittent executors.

Physical executors catch hardware errors and return (False, False). Algorithm callers expect a tuple and check if not success: without try/except — raising crashes the algorithm instead of triggering graceful backtrack/retry.

UT coverage: PR #22545 (74b3efb) — 4 new tests (failure_rate=1.0 verifies return contract, failure_rate=0.0 verifies success path).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Re: Docstring parameter name mismatch

Fixed (4838e0a): create() docstring now uses executor_env matching the actual parameter name (was env).

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT and others added 7 commits March 24, 2026 12:59
Implement platform-specific executors and factory pattern for executor management:

Executor Registry:
- ExecutorRegistry: Factory for environment-specific executor instantiation
  Supports 'physical' and 'sim' environments with lazy loading

Physical Executors:
- PfcXoffProbingExecutor: PFC Xoff detection on physical hardware via SAI
- IngressDropProbingExecutor: Ingress drop detection on physical hardware

Simulation Executors:
- SimPfcXoffProbingExecutor: Mock PFC Xoff detection for unit/integration tests
- SimIngressDropProbingExecutor: Mock ingress drop detection for testing

All executors implement the ProbingExecutorProtocol interface, enabling
algorithm-executor separation and testability through simulation.

Signed-off-by: Xu Chen <xuchen3@microsoft.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
New scenario executors that deterministically fail verification at
specific packet count values. Used to reproduce and test the range
algorithm anti-oscillation backtrack behavior.

- SimPfcXoffProbingExecutorBadSpot: registered as (pfc_xoff, sim, bad_spot)
- SimIngressDropProbingExecutorBadSpot: registered as (ingress_drop, sim, bad_spot)

Both track bad_hit_count for test assertions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
…xception paths

Lines 281 and 290 used 'if self.verbose:' without 'and self.observer',
causing AttributeError when observer is None. All other 7 verbose blocks
in the same file correctly use 'if self.verbose and self.observer:'.

Addresses @StormLiangMS review: missing self.observer None guard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
…p executors

PfcXoff used explicit all_true/all_false/all_equal pattern (8 lines).
IngressDrop used concise set-dedup pattern (3 lines). Both produce
identical outcomes for all input combinations.

Unified to the concise pattern with descriptive comment:
  return_result = (True, results[0])
  # Multiple attempts: check consistency (set dedup detects mixed True/False)
  if len(results) > 1 and len(set(results)) > 1:
      return_result = (False, False)

UT coverage: PR sonic-net#22545 (e2fca74) — 12 new tests covering all 6 input
combinations for both executors.

Addresses @StormLiangMS review: inconsistent result patterns.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
…ing Exception

Physical executors catch hardware errors and return (False, False).
Algorithm callers expect a tuple and check 'if not success:' without
try/except. Raising Exception crashes the algorithm instead of
triggering graceful retry/backtrack.

Changed: raise Exception(error_msg) → return (False, False)
Files: sim_pfc_xoff_probing_executor.py, sim_ingress_drop_probing_executor.py

UT coverage: PR sonic-net#22545 (74b3efb) — 4 new tests verifying the return contract.

Addresses @StormLiangMS review: sim intermittent raises exceptions
instead of returning (False, False) like physical executors.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
Addresses @StormLiangMS review: docstring parameter name mismatch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
@XuChen-MSFT XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr03-executors branch from 4838e0a to 08f5c9c Compare March 24, 2026 04:59
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 24, 2026
2 new tests (order 8817-8818) covering the observer None guard fix:
- test_check_verbose_observer_none_guard_result_path: verifies L281 guard
- test_check_verbose_observer_none_guard_exception_path: verifies L290 guard

Both tests bypass the L162 assert by nullifying observer mid-execution
via counter read side_effect, then verify no AttributeError crash.

Related: PR sonic-net#22541 fix (observer None guard)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 24, 2026
6 UTs for IngressDropProbingExecutor (order 8819-8824):
  single detected/not-detected, multi all/none/inconsistent-TF/FT

6 UTs for PfcXoffProbingExecutor (order 8719-8724):
  same 6 input combinations

Covers all result analysis code paths for the consistency pattern
unification in PR sonic-net#22541. All 6 scenarios produce identical outcomes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 24, 2026
2 bug reproduction tests (failure_rate=1.0):
- test_pfc_xoff_intermittent_returns_tuple_not_exception (order 12)
- test_ingress_drop_intermittent_returns_tuple_not_exception (order 14)
  Verify check() returns (False, False) on failure, not raise Exception.

2 success path tests (failure_rate=0.0):
- test_pfc_xoff_intermittent_success_path (order 13)
- test_ingress_drop_intermittent_success_path (order 15)
  Verify normal detection works when no failure occurs.

Related: PR sonic-net#22541 fix (intermittent raise → return)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Approve with 2 findings


[Critical] importlib.import_module without package context will fail
executor_registry.py:146

importlib.import_module(module_path)

Uses bare module names like "pfc_xoff_probing_executor" without specifying the package. This works in unit tests because conftest.py does sys.path.insert(0, probe_dir), but may fail in production PTF runs if the probe directory is not on sys.path.

Suggested fix:

importlib.import_module(f".{module_path}", package=__package__)

or construct absolute imports:

importlib.import_module(f"tests.saitests.probe.{module_path}")

[Low] Incomplete docstring in SimPfcXoffProbingExecutor.check()
sim_pfc_xoff_probing_executor.py:1115-1129

Missing method description line, "Args:" header, and src_port parameter documentation. The docstring jumps straight to dst_port.

Added method description, Args header, and src_port parameter documentation.
Docstring was jumping straight to dst_port without method description.

Addresses @StormLiangMS review (2026-03-25): incomplete docstring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@XuChen-MSFT
Copy link
Copy Markdown
Contributor Author

@StormLiangMS Re: [Critical] importlib.import_module without package context

This was addressed in a previous reply — bare module import is intentional for PTF compatibility:

  1. PTF runner adds test directories to sys.path before execution. Using relative imports (f".{module_path}") or absolute paths (f"tests.saitests.probe.{module_path}") would break in PTF containers where the directory structure differs from the repo layout.
  2. ImportError is caught and re-raised with the module name (L142-144), so failures are not silent.
  3. This follows the same pattern used by all existing SAI test files.

Re: [Low] Incomplete docstring

Fixed (f4b831b): Added method description, Args: header, and src_port parameter to SimPfcXoffProbingExecutor.check() docstring.

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants