[mmu probing] pr09.test: Add production probe test and infrastructure updates#22547
[mmu probing] pr09.test: Add production probe test and infrastructure updates#22547XuChen-MSFT wants to merge 12 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Found a couple blocking issues:
These likely explain the static analysis failure. Please fix and re-run checks. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Regarding to PR test failure: |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Added conditional mark for
Each issue tracks the validation work to ensure MMU threshold probing tests run successfully without corner cases. Once validation completes for a platform, the corresponding conditional skip will be removed. Conditional mark location: |
@yxieca Thanks for the review. |
…ic-net#22547) <!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it In the case of ASIC detection failures on Broadcom (or if the ASIC couldn't be detected in time), the `/dev/shm` partition in the syncd container will be only 64MB, which might cause issues if syncd/Broadcom SAI library needs more space than that. ##### Work item tracking - Microsoft ADO **(number only)**: #### How I did it Since using a larger `/dev/shm` on its own doesn't cause any issues, bump up the default to 512MB. This should be enough for most platforms. #### How to verify it <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
StormLiangMS
left a comment
There was a problem hiding this comment.
Review — PR #22547 (MMU Probing: Production Tests + Infrastructure)
This is the top of the 9-PR MMU probing stack. I reviewed the full series (#22539–#22547). Overall the framework is well-architected — binary search threshold probing with pluggable algorithms/executors/observers is a solid design. However, I found 2 ship-blockers in this PR and several issues across the stack.
🔴 Ship-Blockers
1. Typo in config key — " breakout" with leading space
In test_qos_probe.py (~line 449):
qosConfig = dutQosConfig["param"][portSpeedCableLength][" breakout"]Note the leading space in " breakout". This will KeyError at runtime for any breakout SKU running testQosIngressDropProbe. Compare with the correct ["breakout"] (no space) used in testQosPfcXoffProbe.
2. set() passed where list is required
In qos_sai_base.py (~line 174):
pytest_assert(self.replaceNonExistentPortId(testPortIds, set(portIds)), ...)replaceNonExistentPortId does portIds[idx] = freePorts.pop(0) — index assignment on a set raises TypeError. Sets don't support indexing.
⚠️ Design Issues
-
find_cell_size()duplicated in 3 test methods — identical recursive search helper intestQosPfcXoffProbe,testQosIngressDropProbe, andtestQosHeadroomPoolProbe. Should be a class method or standalone utility. -
src_port_vlansmay be unbound — intestQosHeadroomPoolProbe, it's set only inside theif platform_asic == "broadcom-dnx"block but appears to be referenced later unconditionally for DNX platforms. -
in_py3variable name misleading inptf_runner.py— now True for bothpy3andprobedirectories. Name should reflect the broader meaning.
❓ Question
tests_mark_conditions.yaml— some skip conditions include GitHub issue URLs in the condition string (e.g.,"asic_type in ['vs'] and https://github.com/..."). Does the conditional_mark plugin actually parse these? The URL portion would be aNameErrorin Pythoneval().
Cross-Stack Issues (from #22540, #22541, #22544)
See individual PR reviews for details, but the most important ones:
- #22540: Infinite loop in lower-bound algorithm when
currentreaches 1 (max(1//2, 1) == 1forever) - #22540: Point algorithm continues sending incremental traffic after a failure (corrupted buffer state)
- #22541: Missing
self.observerNone guard iningress_drop_probing_executor.pyverbose trace block - #22544:
continueon PG failure skips buffer cleanup — next PG probes with corrupted buffer state - #22544: Massive code duplication in
_create_algorithms()between PfcXoff and IngressDrop classes (~120 identical lines each)
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
@StormLiangMS Thanks for the thorough review across the entire 9-PR stack. Addressed all items for this PR below. ✅ Fixed (4 commits pushed)1. 2. 3. 4. ℹ️ Acknowledged — Deferring to Validation Phase5. I'd prefer not to refactor this path preemptively at this stage for two reasons:
Will revisit once the cross-platform validation rounds are complete. ℹ️ By Design — No Change Needed6. URL in Cross-Stack Issues (#22540, #22541, #22544)Will address in the respective PRs separately — thanks for flagging them. |
Enable probe framework for production testbed usage with infrastructure updates and physical hardware test cases. Infrastructure Updates: 1. tests/conftest.py: - Add --enable_qos_ptf_pdb option for PTF debugging with pdb breakpoint - Add --ingress_drop_probing option to switch between PFC/Drop probing modes 2. tests/ptf_runner.py: - Add 'probe' subdirectory support alongside 'py3' - Add test_subdir parameter for flexible PTF test location - Enable probe tests to run via PTF runner infrastructure 3. tests/qos/qos_sai_base.py (QosSaiBase refactoring): - Move replaceNonExistentPortId() from TestQosSai to base class - Move updateTestPortIdIp() from TestQosSai to base class - Add bufferConfig to dut_qos_maps fixture for all devices - Enable probe tests to access buffer configuration - Shared utility methods for port ID/IP management 4. tests/qos/test_qos_sai.py: - Remove replaceNonExistentPortId() (moved to base) - Remove updateTestPortIdIp() (moved to base) - Reduce code duplication Production Test Cases: 5. tests/qos/test_qos_probe.py (NEW - 544 lines): - TestQosProbe class for physical testbed probing - test_pfc_xoff_probing: PFC Xoff threshold detection on hardware - test_ingress_drop_probing: Ingress drop threshold detection - test_headroom_pool_probing: Headroom pool size probing - Integrates with existing QoS test infrastructure - Uses physical executors for real hardware validation - Validates probe framework on production testbeds This PR completes the probe framework integration, enabling threshold probing tests to run on physical SONiC testbeds alongside existing QoS tests. Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Fix KeyError in testQosIngressDropProbe for breakout SKUs. Line 206 had [" breakout"] (with leading space) instead of ["breakout"]. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
Deduplicate find_cell_size() which was identically defined 3 times as nested functions inside testQosPfcXoffProbe, testQosIngressDropProbe, and testQosHeadroomPoolProbe. Now a single @staticmethod on TestQosProbe. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
set() does not support index assignment (portIds[idx] = ...) which replaceNonExistentPortId() uses internally. This is a pre-existing bug from PR sonic-net#8149 (test_qos_sai.py L345), moved here during refactoring. The set() was likely intended for deduplication but breaks item assignment. In practice it rarely triggered because most testbeds have all valid ports, so the index assignment path was never reached. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
The boolean originally indicated whether a test file was in the py3/ subdirectory. With the addition of the probe/ subdirectory for the MMU threshold probing framework, the variable now indicates whether the test is in any subdirectory (py3 or probe). Rename for clarity. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
6ce1d4c to
98db457
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
StormLiangMS
left a comment
There was a problem hiding this comment.
[High] Broadcom-specific bcmcmd runs on all platforms
test_qos_probe.py:632-647
testQosHeadroomPoolProbe runs bcmcmd "knetctrl netif show" without first checking sonic_asic_type == "broadcom". The ASIC type check at line 653 only guards TD2/TD3 filtering logic, not the initial bcmcmd invocations. This will crash on Mellanox/Cisco with "command not found".
Suggested fix: Wrap all bcmcmd-related code in if dutTestParams["basicParams"]["sonic_asic_type"] == "broadcom":, and provide a skip or alternative path for non-Broadcom platforms.
[Medium] max() on potentially empty dict
test_qos_probe.py:707
max(xpe_to_testports.keys(), ...)If xpe_to_testports is empty (bcmcmd fails, unexpected output, all ports filtered), max() raises ValueError.
Suggested fix:
if not xpe_to_testports:
pytest.skip("No available test ports found for probing")[Low] set → list fix in replaceNonExistentPortId
qos_sai_base.py:189
Changed from set(portIds) to list(portIds). This is actually a bug fix — the method uses portIds[idx] = ... which requires list indexing. Just verify no other callers still pass sets.
[Low] Unused --ingress_drop_probing CLI option
conftest.py:46
The option is defined with parser.addoption("--ingress_drop_probing", ...) but never consumed via getoption. Both test methods run as separate parametrized tests regardless. Consider implementing the gating logic or removing the unused option.
bcmcmd commands (knetctrl netif show, show pmap) are Broadcom-specific and crash on Mellanox/Cisco with 'command not found'. Wrapped entire XPE port mapping logic in 'if sonic_asic_type == broadcom:' guard. Non-Broadcom platforms use all available test ports directly. Addresses @StormLiangMS review (2026-03-25): bcmcmd runs on all platforms. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
max() on empty dict raises ValueError. Skip test gracefully when no available test ports are found after XPE mapping. Addresses @StormLiangMS review (2026-03-25): max() on potentially empty dict. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
After bcmcmd guard, variables like bcmport_to_sonicport, xpe_to_bcmports are only defined inside the broadcom branch. Logger.info must only reference variables available in both branches. Co-authored-by: Copilot <[email protected]> Signed-off-by: Xu Chen <[email protected]>
|
/azp run |
|
@StormLiangMS Re: 4 findings [High] bcmcmd on non-Broadcom — Fixed ( [Medium] max() on empty dict — Fixed ( [Low] set→list in replaceNonExistentPortId — Already fixed in earlier commit ( [Low] Unused --ingress_drop_probing CLI option — This is a pre-reserved parameter for future gating logic (switching between PFC/IngressDrop probing modes via CLI). Currently both tests run as separate methods. Will implement the gating or remove when the probing mode selection is finalized. |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
Summary:
Enable probe framework for production testbed usage with infrastructure updates and physical hardware test cases.
Infrastructure Updates:
tests/conftest.py:
tests/ptf_runner.py:
tests/qos/qos_sai_base.py (QosSaiBase refactoring):
tests/qos/test_qos_sai.py:
Production Test Cases:
This PR completes the probe framework integration, enabling threshold probing tests to run on physical SONiC testbeds alongside existing QoS tests.
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
qos refactoring
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates