[mmu probing] pr06.probe: Add base framework and all probing implementations by XuChen-MSFT · Pull Request #22544 · sonic-net/sonic-mgmt

XuChen-MSFT · 2026-02-23T13:54:38Z

Description of PR

Summary:

Implement complete probing framework using template method pattern and three concrete probing test types:

Base Framework:

init.py: Module documentation and architecture overview
ProbingBase: Abstract base class implementing template method pattern
- setUp(): PTF initialization and parameter parsing
- runTest(): Template method orchestrating probe workflow
- Abstract methods: setup_traffic(), probe(), get_probe_config()
- Integrates algorithms, executors, observers, and buffer control

Probing Implementations:

PfcXoffProbing (1->N pattern):
- Detects PFC Xoff threshold by observing PFC frame generation
- Single source sending to multiple destinations
- Uses UpperBound -> LowerBound -> ThresholdRange algorithms
IngressDropProbing (1->N pattern):
- Detects ingress drop threshold by packet loss observation
- Single source sending to multiple destinations
- Uses UpperBound -> LowerBound -> ThresholdPoint algorithms
HeadroomPoolProbing (N->1 pattern):
- Detects headroom pool size via multi-PG iteration
- Multiple sources sending to single destination
- Iterates across priority groups for comprehensive testing

All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics.

Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

Deferred Enhancements (tracked as issues, no functional impact)

#23190 - Refactor: extract _create_algorithms() and _run_algorithms() to ProbingBase (reduce code duplication)
#23192 - Cleanup: remove debug env vars (pgnumlmt, etc.) after pipeline integration complete

mssonicbld · 2026-02-23T13:54:46Z

/azp run

azure-pipelines · 2026-02-23T13:55:02Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

Deep review done. Base framework + probing implementations look correct; no issues found.

mssonicbld · 2026-02-24T15:28:24Z

/azp run

azure-pipelines · 2026-02-24T15:28:42Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

Re-reviewed updates; no new issues found.

wsycqyz

lğtm

StormLiangMS

Review — PR #22544 (MMU Probing: Base Framework)

🔴 Bug

continue on PG failure skips buffer cleanup
When a phase fails (e.g., PFC upper bound), the code does continue to the next PG. However, the buffer may be in an inconsistent state (held, partially filled). The next PG's probing starts with corrupted buffer state. Should call buffer_ctrl.drain_buffer() before continuing.

⚠️ Design

Massive code duplication — PfcXoffProbing._create_algorithms() and IngressDropProbing._create_algorithms() are nearly identical (~120 lines each). The _run_algorithms() methods are completely identical. Should be factored into ProbingBase with a template or builder pattern. Currently, any algorithm workflow change must be duplicated across 3+ files.
os.environ.get('pgnumlmt') for PG limit is fragile — using env vars for debug control in a test framework is unconventional. This should be a test parameter, class attribute, or pytest fixture/marker — not an env var parsed in a hot loop.

❓ Question

setUp() calls switch_init() which is a heavy SAI thrift operation. Is this idempotent? If a test case's setUp fails halfway, does tearDown properly clean up the thrift connection?

…bing Tests verify drain_buffer() is called before continue when a PG phase fails in the multi-PG loop (order 4250-4256): - PFC upper/lower/range failure (3 tests) - Ingress Drop upper/lower/range failure (3 tests) - Different dst_ports isolation (1 test) Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower) vs 3-tuple (range/point) algorithm return values. Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]>

- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe completes without crash when PG fails - test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce independent results Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]>

mssonicbld · 2026-03-23T01:41:35Z

/azp run

azure-pipelines · 2026-03-23T01:41:47Z

Azure Pipelines successfully started running 1 pipeline(s).

XuChen-MSFT · 2026-03-23T01:41:49Z

@StormLiangMS Re: continue on PG failure skips buffer cleanup

Fixed (7c6b4fa): Refactored multi-PG probe loop from 6 scattered continue statements to a while True single-pass block with unified cleanup:

pg_success = False
fail_reason = None
while True:  # single-pass block — break = goto cleanup
    if pfc_upper is None:
        fail_reason = "PFC XOFF upper bound failure"
        break
    # ... all phases ...
    pg_success = True
    break

if not pg_success:
    ProbingObserver.console(f"  Skipping PG #{i+1} due to {fail_reason}")
    self.buffer_ctrl.drain_buffer([dst_port_id])
    continue

Single drain_buffer call in the unified cleanup block — no scattered cleanup, future-proof for new phases.

Test coverage:

UT: PR [mmu probing] pr07.test: Add comprehensive unit tests for probe framework #22545 (3d75029) — 7 new tests (6 phase failures + different dst_ports)
IT: PR [mmu probing] pr08.test: Add integration tests for end-to-end probing workflows #22546 (14a29c2) — 2 new tests (buffer cleanup + multi-PG isolation)

XuChen-MSFT · 2026-03-23T01:55:36Z

@StormLiangMS Re: Code duplication in _create_algorithms() and _run_algorithms()

Agreed — _run_algorithms() is 100% identical across PfcXoff and IngressDrop (~55 lines), and _create_algorithms() differs by only 1 context_template string (~124 lines each).

Deferring to a follow-up PR because:

Involves 3 files (ProbingBase + 2 subclasses), ~180 lines of refactoring
headroom_pool_probing.py just underwent a structural refactor (while-True cleanup in 7c6b4fa)
Stacking refactors increases regression risk

Tracked in issue #23190 with a phased approach:

Extract identical _run_algorithms() to ProbingBase (low risk)
Parameterize _create_algorithms() via class attribute (medium risk)
Align HeadroomPoolProbing inline pattern (higher risk)

XuChen-MSFT · 2026-03-23T02:02:18Z

@StormLiangMS Re: os.environ.get("pgnumlmt") for PG limit is fragile

Agreed this should be cleaned up, but deferring until after pipeline integration is complete.

These debug env vars (pgnumlmt, INGRESS_DROP_USE_PG_COUNTER) are currently used for manual debugging during per-SKU lightning validation. Removing them now would require re-adding them when debugging issues across different ASICs. Once all SKU validations are done and the pipeline is stable, they can be replaced with proper test parameters or removed.

Tracked in issue #23192.

XuChen-MSFT · 2026-03-23T02:27:02Z

@StormLiangMS Re: setUp() idempotency and tearDown safety

Thrift connection: Safe. tearDown() calls ThriftInterfaceDataPlane.tearDown() which closes src_transport and dst_transport. PTF guarantees tearDown runs even if setUp fails halfway, so no connection leak.

switch_init() idempotency: Yes — uses a global switch_inited flag:

def switch_init(clients):
    global switch_inited
    if switch_inited:
        return  # Already initialized, skip

Designed to run once per session (~10s initialization). No switch_cleanup() exists because switch config is intentionally shared across tests in the same session.

This is the standard SAI test pattern — pfc_asym.py, sai_qos_tests.py, and all other SAI tests follow the same lifecycle: setUp → switch_init → parse_param, tearDown closes thrift only. Not introduced by this PR.

Implement complete probing framework using template method pattern and three concrete probing test types: Base Framework: - __init__.py: Module documentation and architecture overview - ProbingBase: Abstract base class implementing template method pattern - setUp(): PTF initialization and parameter parsing - runTest(): Template method orchestrating probe workflow - Abstract methods: setup_traffic(), probe(), get_probe_config() - Integrates algorithms, executors, observers, and buffer control Probing Implementations: 1. PfcXoffProbing (1→N pattern): - Detects PFC Xoff threshold by observing PFC frame generation - Single source sending to multiple destinations - Uses UpperBound → LowerBound → ThresholdRange algorithms 2. IngressDropProbing (1→N pattern): - Detects ingress drop threshold by packet loss observation - Single source sending to multiple destinations - Uses UpperBound → LowerBound → ThresholdPoint algorithms 3. HeadroomPoolProbing (N→1 pattern): - Detects headroom pool size via multi-PG iteration - Multiple sources sending to single destination - Iterates across priority groups for comprehensive testing All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics. Signed-off-by: Xu Chen <[email protected]>

Signed-off-by: Xu Chen <[email protected]>

@StormLiangMS

Refactored multi-PG probe loop from 6 scattered 'continue' statements to while-True single-pass block with unified cleanup: - break + fail_reason on any phase failure - pg_success flag tracks completion - Single drain_buffer([dst_port_id]) call in cleanup block This ensures buffer state is always drained before moving to the next PG, preventing corrupted buffer from affecting subsequent PG probing. UT coverage: PR sonic-net#22545 (3d75029) — 7 new tests IT coverage: PR sonic-net#22546 (14a29c2) — 2 new tests Addresses @StormLiangMS review: continue on PG failure skips buffer cleanup. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

…bing Tests verify drain_buffer() is called before continue when a PG phase fails in the multi-PG loop (order 4250-4256): - PFC upper/lower/range failure (3 tests) - Ingress Drop upper/lower/range failure (3 tests) - Different dst_ports isolation (1 test) Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower) vs 3-tuple (range/point) algorithm return values. Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe completes without crash when PG fails - test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce independent results Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>

mssonicbld · 2026-03-24T04:59:47Z

/azp run

azure-pipelines · 2026-03-24T05:00:01Z

Azure Pipelines successfully started running 1 pipeline(s).

StormLiangMS

✅ LGTM — Solid framework implementation

Template method pattern correctly implemented:

ProbingBase.runTest() defines the workflow skeleton; subclasses (PfcXoffProbing, IngressDropProbing, HeadroomPoolProbing) serve as orchestrators
Traffic patterns (1→N port-based vs N→1 PG-based) properly abstracted
Algorithm orchestration follows correct sequence: UpperBound → LowerBound → ThresholdRange → optional ThresholdPoint
Resource management with setup/teardown and error recovery looks sound

Minor edge case: In HeadroomPoolProbing, if len(self.dscps) < len(self.pgs), dscp = self.dscps[pg_idx] would raise IndexError. Appears to be a documented constraint (validation warnings at lines 189-192), just flagging for awareness.

…tations (sonic-net#22544) Description of PR Summary: Implement complete probing framework using template method pattern and three concrete probing test types: Base Framework: init.py: Module documentation and architecture overview ProbingBase: Abstract base class implementing template method pattern setUp(): PTF initialization and parameter parsing runTest(): Template method orchestrating probe workflow Abstract methods: setup_traffic(), probe(), get_probe_config() Integrates algorithms, executors, observers, and buffer control Probing Implementations: PfcXoffProbing (1->N pattern): Detects PFC Xoff threshold by observing PFC frame generation Single source sending to multiple destinations Uses UpperBound -> LowerBound -> ThresholdRange algorithms IngressDropProbing (1->N pattern): Detects ingress drop threshold by packet loss observation Single source sending to multiple destinations Uses UpperBound -> LowerBound -> ThresholdPoint algorithms HeadroomPoolProbing (N->1 pattern): Detects headroom pool size via multi-PG iteration Multiple sources sending to single destination Iterates across priority groups for comprehensive testing All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics. Signed-off-by: Xu Chen <[email protected]> Co-authored-by: Copilot <[email protected]>

XuChen-MSFT requested review from StormLiangMS, bingwang-ms, kperumalbfn, wsycqyz and yxieca February 23, 2026 13:55

yxieca previously approved these changes Feb 23, 2026

View reviewed changes

XuChen-MSFT dismissed yxieca’s stale review via bd02acf February 24, 2026 15:28

yxieca previously approved these changes Feb 24, 2026

View reviewed changes

wsycqyz previously approved these changes Feb 25, 2026

View reviewed changes

StormLiangMS reviewed Mar 17, 2026

View reviewed changes

XuChen-MSFT dismissed stale reviews from wsycqyz and yxieca via 7c6b4fa March 23, 2026 01:41

XuChen-MSFT mentioned this pull request Mar 23, 2026

[mmu probing] Refactor: extract _create_algorithms() and _run_algorithms() to ProbingBase #23190

Open

XuChen-MSFT mentioned this pull request Mar 23, 2026

[mmu probing] Cleanup: remove debug env vars after pipeline integration complete #23192

Open

XuChen-MSFT and others added 3 commits March 24, 2026 12:59

fix pre-commit errors

d5eda97

Signed-off-by: Xu Chen <[email protected]>

XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr06-framework branch from 7c6b4fa to 7b8d48b Compare March 24, 2026 04:59

StormLiangMS approved these changes Mar 25, 2026

View reviewed changes

StormLiangMS merged commit bdcbf8e into sonic-net:master Mar 25, 2026
15 checks passed

Conversation

XuChen-MSFT commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Deferred Enhancements (tracked as issues, no functional impact)

Uh oh!

mssonicbld commented Feb 23, 2026

Uh oh!

azure-pipelines bot commented Feb 23, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Feb 24, 2026

Uh oh!

azure-pipelines bot commented Feb 24, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Uh oh!

wsycqyz left a comment

Choose a reason for hiding this comment

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Review — PR #22544 (MMU Probing: Base Framework)

🔴 Bug

⚠️ Design

❓ Question

Uh oh!

mssonicbld commented Mar 23, 2026

Uh oh!

azure-pipelines bot commented Mar 23, 2026

Uh oh!

XuChen-MSFT commented Mar 23, 2026

Uh oh!

XuChen-MSFT commented Mar 23, 2026

Uh oh!

XuChen-MSFT commented Mar 23, 2026

Uh oh!

XuChen-MSFT commented Mar 23, 2026

Uh oh!

mssonicbld commented Mar 24, 2026

Uh oh!

azure-pipelines bot commented Mar 24, 2026

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

XuChen-MSFT commented Feb 23, 2026 •

edited

Loading