Skip to content

[mmu probing] pr06.probe: Add base framework and all probing implementations#22544

Merged
StormLiangMS merged 3 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr06-framework
Mar 25, 2026
Merged

[mmu probing] pr06.probe: Add base framework and all probing implementations#22544
StormLiangMS merged 3 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr06-framework

Conversation

@XuChen-MSFT
Copy link
Contributor

@XuChen-MSFT XuChen-MSFT commented Feb 23, 2026

Description of PR

Summary:

Implement complete probing framework using template method pattern and three concrete probing test types:

Base Framework:

  • init.py: Module documentation and architecture overview
  • ProbingBase: Abstract base class implementing template method pattern
    • setUp(): PTF initialization and parameter parsing
    • runTest(): Template method orchestrating probe workflow
    • Abstract methods: setup_traffic(), probe(), get_probe_config()
    • Integrates algorithms, executors, observers, and buffer control

Probing Implementations:

  1. PfcXoffProbing (1->N pattern):

    • Detects PFC Xoff threshold by observing PFC frame generation
    • Single source sending to multiple destinations
    • Uses UpperBound -> LowerBound -> ThresholdRange algorithms
  2. IngressDropProbing (1->N pattern):

    • Detects ingress drop threshold by packet loss observation
    • Single source sending to multiple destinations
    • Uses UpperBound -> LowerBound -> ThresholdPoint algorithms
  3. HeadroomPoolProbing (N->1 pattern):

    • Detects headroom pool size via multi-PG iteration
    • Multiple sources sending to single destination
    • Iterates across priority groups for comprehensive testing

All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

Deferred Enhancements (tracked as issues, no functional impact)

  • #23190 - Refactor: extract _create_algorithms() and _run_algorithms() to ProbingBase (reduce code duplication)
  • #23192 - Cleanup: remove debug env vars (pgnumlmt, etc.) after pipeline integration complete

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 23, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep review done. Base framework + probing implementations look correct; no issues found.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Feb 24, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed updates; no new issues found.

wsycqyz
wsycqyz previously approved these changes Feb 25, 2026
Copy link
Contributor

@wsycqyz wsycqyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lğtm

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #22544 (MMU Probing: Base Framework)

🔴 Bug

continue on PG failure skips buffer cleanup
When a phase fails (e.g., PFC upper bound), the code does continue to the next PG. However, the buffer may be in an inconsistent state (held, partially filled). The next PG's probing starts with corrupted buffer state. Should call buffer_ctrl.drain_buffer() before continuing.

⚠️ Design

  • Massive code duplicationPfcXoffProbing._create_algorithms() and IngressDropProbing._create_algorithms() are nearly identical (~120 lines each). The _run_algorithms() methods are completely identical. Should be factored into ProbingBase with a template or builder pattern. Currently, any algorithm workflow change must be duplicated across 3+ files.

  • os.environ.get('pgnumlmt') for PG limit is fragile — using env vars for debug control in a test framework is unconventional. This should be a test parameter, class attribute, or pytest fixture/marker — not an env var parsed in a hot loop.

❓ Question

  • setUp() calls switch_init() which is a heavy SAI thrift operation. Is this idempotent? If a test case's setUp fails halfway, does tearDown properly clean up the thrift connection?

XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 23, 2026
…bing

Tests verify drain_buffer() is called before continue when a PG phase
fails in the multi-PG loop (order 4250-4256):
- PFC upper/lower/range failure (3 tests)
- Ingress Drop upper/lower/range failure (3 tests)
- Different dst_ports isolation (1 test)

Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower)
vs 3-tuple (range/point) algorithm return values.

Related: PR sonic-net#22544 fix (while-True unified cleanup)

Co-authored-by: Copilot <[email protected]>
XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 23, 2026
- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe
  completes without crash when PG fails
- test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce
  independent results

Related: PR sonic-net#22544 fix (while-True unified cleanup)

Co-authored-by: Copilot <[email protected]>
@XuChen-MSFT XuChen-MSFT dismissed stale reviews from wsycqyz and yxieca via 7c6b4fa March 23, 2026 01:41
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: continue on PG failure skips buffer cleanup

Fixed (7c6b4fa): Refactored multi-PG probe loop from 6 scattered continue statements to a while True single-pass block with unified cleanup:

pg_success = False
fail_reason = None
while True:  # single-pass block — break = goto cleanup
    if pfc_upper is None:
        fail_reason = "PFC XOFF upper bound failure"
        break
    # ... all phases ...
    pg_success = True
    break

if not pg_success:
    ProbingObserver.console(f"  Skipping PG #{i+1} due to {fail_reason}")
    self.buffer_ctrl.drain_buffer([dst_port_id])
    continue

Single drain_buffer call in the unified cleanup block — no scattered cleanup, future-proof for new phases.

Test coverage:

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: Code duplication in _create_algorithms() and _run_algorithms()

Agreed — _run_algorithms() is 100% identical across PfcXoff and IngressDrop (~55 lines), and _create_algorithms() differs by only 1 context_template string (~124 lines each).

Deferring to a follow-up PR because:

  • Involves 3 files (ProbingBase + 2 subclasses), ~180 lines of refactoring
  • headroom_pool_probing.py just underwent a structural refactor (while-True cleanup in 7c6b4fa)
  • Stacking refactors increases regression risk

Tracked in issue #23190 with a phased approach:

  1. Extract identical _run_algorithms() to ProbingBase (low risk)
  2. Parameterize _create_algorithms() via class attribute (medium risk)
  3. Align HeadroomPoolProbing inline pattern (higher risk)

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: os.environ.get("pgnumlmt") for PG limit is fragile

Agreed this should be cleaned up, but deferring until after pipeline integration is complete.

These debug env vars (pgnumlmt, INGRESS_DROP_USE_PG_COUNTER) are currently used for manual debugging during per-SKU lightning validation. Removing them now would require re-adding them when debugging issues across different ASICs. Once all SKU validations are done and the pipeline is stable, they can be replaced with proper test parameters or removed.

Tracked in issue #23192.

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: setUp() idempotency and tearDown safety

Thrift connection: Safe. tearDown() calls ThriftInterfaceDataPlane.tearDown() which closes src_transport and dst_transport. PTF guarantees tearDown runs even if setUp fails halfway, so no connection leak.

switch_init() idempotency: Yes — uses a global switch_inited flag:

def switch_init(clients):
    global switch_inited
    if switch_inited:
        return  # Already initialized, skip

Designed to run once per session (~10s initialization). No switch_cleanup() exists because switch config is intentionally shared across tests in the same session.

This is the standard SAI test patternpfc_asym.py, sai_qos_tests.py, and all other SAI tests follow the same lifecycle: setUp → switch_init → parse_param, tearDown closes thrift only. Not introduced by this PR.

XuChen-MSFT and others added 3 commits March 24, 2026 12:59
Implement complete probing framework using template method pattern and
three concrete probing test types:

Base Framework:
- __init__.py: Module documentation and architecture overview
- ProbingBase: Abstract base class implementing template method pattern
  - setUp(): PTF initialization and parameter parsing
  - runTest(): Template method orchestrating probe workflow
  - Abstract methods: setup_traffic(), probe(), get_probe_config()
  - Integrates algorithms, executors, observers, and buffer control

Probing Implementations:

1. PfcXoffProbing (1→N pattern):
   - Detects PFC Xoff threshold by observing PFC frame generation
   - Single source sending to multiple destinations
   - Uses UpperBound → LowerBound → ThresholdRange algorithms

2. IngressDropProbing (1→N pattern):
   - Detects ingress drop threshold by packet loss observation
   - Single source sending to multiple destinations
   - Uses UpperBound → LowerBound → ThresholdPoint algorithms

3. HeadroomPoolProbing (N→1 pattern):
   - Detects headroom pool size via multi-PG iteration
   - Multiple sources sending to single destination
   - Iterates across priority groups for comprehensive testing

All implementations leverage the executor registry for environment-specific
behavior (physical hardware vs simulation) and observer pattern for metrics.

Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Refactored multi-PG probe loop from 6 scattered 'continue' statements
to while-True single-pass block with unified cleanup:
- break + fail_reason on any phase failure
- pg_success flag tracks completion
- Single drain_buffer([dst_port_id]) call in cleanup block

This ensures buffer state is always drained before moving to the next PG,
preventing corrupted buffer from affecting subsequent PG probing.

UT coverage: PR sonic-net#22545 (3d75029) — 7 new tests
IT coverage: PR sonic-net#22546 (14a29c2) — 2 new tests

Addresses @StormLiangMS review: continue on PG failure skips buffer cleanup.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@XuChen-MSFT XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr06-framework branch from 7c6b4fa to 7b8d48b Compare March 24, 2026 04:59
XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 24, 2026
…bing

Tests verify drain_buffer() is called before continue when a PG phase
fails in the multi-PG loop (order 4250-4256):
- PFC upper/lower/range failure (3 tests)
- Ingress Drop upper/lower/range failure (3 tests)
- Different dst_ports isolation (1 test)

Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower)
vs 3-tuple (range/point) algorithm return values.

Related: PR sonic-net#22544 fix (while-True unified cleanup)

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
XuChen-MSFT added a commit to XuChen-MSFT/sonic-mgmt that referenced this pull request Mar 24, 2026
- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe
  completes without crash when PG fails
- test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce
  independent results

Related: PR sonic-net#22544 fix (while-True unified cleanup)

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ LGTM — Solid framework implementation

Template method pattern correctly implemented:

  • ProbingBase.runTest() defines the workflow skeleton; subclasses (PfcXoffProbing, IngressDropProbing, HeadroomPoolProbing) serve as orchestrators
  • Traffic patterns (1→N port-based vs N→1 PG-based) properly abstracted
  • Algorithm orchestration follows correct sequence: UpperBound → LowerBound → ThresholdRange → optional ThresholdPoint
  • Resource management with setup/teardown and error recovery looks sound

Minor edge case: In HeadroomPoolProbing, if len(self.dscps) < len(self.pgs), dscp = self.dscps[pg_idx] would raise IndexError. Appears to be a documented constraint (validation warnings at lines 189-192), just flagging for awareness.

@StormLiangMS StormLiangMS merged commit bdcbf8e into sonic-net:master Mar 25, 2026
15 checks passed
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…tations (sonic-net#22544)

Description of PR
Summary:

Implement complete probing framework using template method pattern and three concrete probing test types:

Base Framework:

init.py: Module documentation and architecture overview
ProbingBase: Abstract base class implementing template method pattern
setUp(): PTF initialization and parameter parsing
runTest(): Template method orchestrating probe workflow
Abstract methods: setup_traffic(), probe(), get_probe_config()
Integrates algorithms, executors, observers, and buffer control
Probing Implementations:

PfcXoffProbing (1->N pattern):

Detects PFC Xoff threshold by observing PFC frame generation
Single source sending to multiple destinations
Uses UpperBound -> LowerBound -> ThresholdRange algorithms
IngressDropProbing (1->N pattern):

Detects ingress drop threshold by packet loss observation
Single source sending to multiple destinations
Uses UpperBound -> LowerBound -> ThresholdPoint algorithms
HeadroomPoolProbing (N->1 pattern):

Detects headroom pool size via multi-PG iteration
Multiple sources sending to single destination
Iterates across priority groups for comprehensive testing
All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics.

Signed-off-by: Xu Chen <[email protected]>
Co-authored-by: Copilot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants