[mmu probing] pr06.probe: Add base framework and all probing implementations#22544
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
Deep review done. Base framework + probing implementations look correct; no issues found.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
Re-reviewed updates; no new issues found.
StormLiangMS
left a comment
There was a problem hiding this comment.
Review — PR #22544 (MMU Probing: Base Framework)
🔴 Bug
continue on PG failure skips buffer cleanup
When a phase fails (e.g., PFC upper bound), the code does continue to the next PG. However, the buffer may be in an inconsistent state (held, partially filled). The next PG's probing starts with corrupted buffer state. Should call buffer_ctrl.drain_buffer() before continuing.
⚠️ Design
-
Massive code duplication —
PfcXoffProbing._create_algorithms()andIngressDropProbing._create_algorithms()are nearly identical (~120 lines each). The_run_algorithms()methods are completely identical. Should be factored intoProbingBasewith a template or builder pattern. Currently, any algorithm workflow change must be duplicated across 3+ files. -
os.environ.get('pgnumlmt')for PG limit is fragile — using env vars for debug control in a test framework is unconventional. This should be a test parameter, class attribute, or pytest fixture/marker — not an env var parsed in a hot loop.
❓ Question
setUp()callsswitch_init()which is a heavy SAI thrift operation. Is this idempotent? If a test case'ssetUpfails halfway, doestearDownproperly clean up the thrift connection?
…bing Tests verify drain_buffer() is called before continue when a PG phase fails in the multi-PG loop (order 4250-4256): - PFC upper/lower/range failure (3 tests) - Ingress Drop upper/lower/range failure (3 tests) - Different dst_ports isolation (1 test) Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower) vs 3-tuple (range/point) algorithm return values. Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]>
- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe completes without crash when PG fails - test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce independent results Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@StormLiangMS Re: Fixed ( pg_success = False
fail_reason = None
while True: # single-pass block — break = goto cleanup
if pfc_upper is None:
fail_reason = "PFC XOFF upper bound failure"
break
# ... all phases ...
pg_success = True
break
if not pg_success:
ProbingObserver.console(f" Skipping PG #{i+1} due to {fail_reason}")
self.buffer_ctrl.drain_buffer([dst_port_id])
continueSingle Test coverage:
|
|
@StormLiangMS Re: Code duplication in _create_algorithms() and _run_algorithms() Agreed — Deferring to a follow-up PR because:
Tracked in issue #23190 with a phased approach:
|
|
@StormLiangMS Re: Agreed this should be cleaned up, but deferring until after pipeline integration is complete. These debug env vars ( Tracked in issue #23192. |
|
@StormLiangMS Re: setUp() idempotency and tearDown safety Thrift connection: Safe. switch_init() idempotency: Yes — uses a global def switch_init(clients):
global switch_inited
if switch_inited:
return # Already initialized, skipDesigned to run once per session (~10s initialization). No This is the standard SAI test pattern — |
Implement complete probing framework using template method pattern and three concrete probing test types: Base Framework: - __init__.py: Module documentation and architecture overview - ProbingBase: Abstract base class implementing template method pattern - setUp(): PTF initialization and parameter parsing - runTest(): Template method orchestrating probe workflow - Abstract methods: setup_traffic(), probe(), get_probe_config() - Integrates algorithms, executors, observers, and buffer control Probing Implementations: 1. PfcXoffProbing (1→N pattern): - Detects PFC Xoff threshold by observing PFC frame generation - Single source sending to multiple destinations - Uses UpperBound → LowerBound → ThresholdRange algorithms 2. IngressDropProbing (1→N pattern): - Detects ingress drop threshold by packet loss observation - Single source sending to multiple destinations - Uses UpperBound → LowerBound → ThresholdPoint algorithms 3. HeadroomPoolProbing (N→1 pattern): - Detects headroom pool size via multi-PG iteration - Multiple sources sending to single destination - Iterates across priority groups for comprehensive testing All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics. Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Refactored multi-PG probe loop from 6 scattered 'continue' statements to while-True single-pass block with unified cleanup: - break + fail_reason on any phase failure - pg_success flag tracks completion - Single drain_buffer([dst_port_id]) call in cleanup block This ensures buffer state is always drained before moving to the next PG, preventing corrupted buffer from affecting subsequent PG probing. UT coverage: PR sonic-net#22545 (3d75029) — 7 new tests IT coverage: PR sonic-net#22546 (14a29c2) — 2 new tests Addresses @StormLiangMS review: continue on PG failure skips buffer cleanup. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
7c6b4fa to
7b8d48b
Compare
…bing Tests verify drain_buffer() is called before continue when a PG phase fails in the multi-PG loop (order 4250-4256): - PFC upper/lower/range failure (3 tests) - Ingress Drop upper/lower/range failure (3 tests) - Different dst_ports isolation (1 test) Uses per-type mock side_effects to correctly handle 2-tuple (upper/lower) vs 3-tuple (range/point) algorithm return values. Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe completes without crash when PG fails - test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce independent results Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
StormLiangMS
left a comment
There was a problem hiding this comment.
✅ LGTM — Solid framework implementation
Template method pattern correctly implemented:
ProbingBase.runTest()defines the workflow skeleton; subclasses (PfcXoffProbing,IngressDropProbing,HeadroomPoolProbing) serve as orchestrators- Traffic patterns (1→N port-based vs N→1 PG-based) properly abstracted
- Algorithm orchestration follows correct sequence: UpperBound → LowerBound → ThresholdRange → optional ThresholdPoint
- Resource management with setup/teardown and error recovery looks sound
Minor edge case: In HeadroomPoolProbing, if len(self.dscps) < len(self.pgs), dscp = self.dscps[pg_idx] would raise IndexError. Appears to be a documented constraint (validation warnings at lines 189-192), just flagging for awareness.
…tations (sonic-net#22544) Description of PR Summary: Implement complete probing framework using template method pattern and three concrete probing test types: Base Framework: init.py: Module documentation and architecture overview ProbingBase: Abstract base class implementing template method pattern setUp(): PTF initialization and parameter parsing runTest(): Template method orchestrating probe workflow Abstract methods: setup_traffic(), probe(), get_probe_config() Integrates algorithms, executors, observers, and buffer control Probing Implementations: PfcXoffProbing (1->N pattern): Detects PFC Xoff threshold by observing PFC frame generation Single source sending to multiple destinations Uses UpperBound -> LowerBound -> ThresholdRange algorithms IngressDropProbing (1->N pattern): Detects ingress drop threshold by packet loss observation Single source sending to multiple destinations Uses UpperBound -> LowerBound -> ThresholdPoint algorithms HeadroomPoolProbing (N->1 pattern): Detects headroom pool size via multi-PG iteration Multiple sources sending to single destination Iterates across priority groups for comprehensive testing All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics. Signed-off-by: Xu Chen <[email protected]> Co-authored-by: Copilot <[email protected]>
Description of PR
Summary:
Implement complete probing framework using template method pattern and three concrete probing test types:
Base Framework:
Probing Implementations:
PfcXoffProbing (1->N pattern):
IngressDropProbing (1->N pattern):
HeadroomPoolProbing (N->1 pattern):
All implementations leverage the executor registry for environment-specific behavior (physical hardware vs simulation) and observer pattern for metrics.
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
qos refactoring
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates
Deferred Enhancements (tracked as issues, no functional impact)