Skip to content

[mmu probing] pr09.test: Add production probe test and infrastructure updates#22547

Open
XuChen-MSFT wants to merge 12 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr09-production
Open

[mmu probing] pr09.test: Add production probe test and infrastructure updates#22547
XuChen-MSFT wants to merge 12 commits intosonic-net:masterfrom
XuChen-MSFT:xuchen3/mmu_probe/pr09-production

Conversation

@XuChen-MSFT
Copy link
Contributor

@XuChen-MSFT XuChen-MSFT commented Feb 23, 2026

Description of PR

Summary:

Enable probe framework for production testbed usage with infrastructure updates and physical hardware test cases.

Infrastructure Updates:

  1. tests/conftest.py:

    • Add --enable_qos_ptf_pdb option for PTF debugging with pdb breakpoint
    • Add --ingress_drop_probing option to switch between PFC/Drop probing modes
  2. tests/ptf_runner.py:

    • Add 'probe' subdirectory support alongside 'py3'
    • Add test_subdir parameter for flexible PTF test location
    • Enable probe tests to run via PTF runner infrastructure
  3. tests/qos/qos_sai_base.py (QosSaiBase refactoring):

    • Move replaceNonExistentPortId() from TestQosSai to base class
    • Move updateTestPortIdIp() from TestQosSai to base class
    • Add bufferConfig to dut_qos_maps fixture for all devices
    • Enable probe tests to access buffer configuration
    • Shared utility methods for port ID/IP management
  4. tests/qos/test_qos_sai.py:

    • Remove replaceNonExistentPortId() (moved to base)
    • Remove updateTestPortIdIp() (moved to base)
    • Reduce code duplication

Production Test Cases:

  1. tests/qos/test_qos_probe.py (NEW - 544 lines):
    • TestQosProbe class for physical testbed probing
    • test_pfc_xoff_probing: PFC Xoff threshold detection on hardware
    • test_ingress_drop_probing: Ingress drop threshold detection
    • test_headroom_pool_probing: Headroom pool size probing
    • Integrates with existing QoS test infrastructure
    • Uses physical executors for real hardware validation
    • Validates probe framework on production testbeds

This PR completes the probe framework integration, enabling threshold probing tests to run on physical SONiC testbeds alongside existing QoS tests.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

qos refactoring

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca
Copy link
Collaborator

yxieca commented Feb 23, 2026

Found a couple blocking issues:

  1. Typo in key:
    has a leading space. Likely should be .

  2. Type error in updateTestPortIdIp():
    passes a set; the helper mutates/indexes the list. Use a list instead (e.g., or keep as list).

These likely explain the static analysis failure. Please fix and re-run checks.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Contributor Author

Regarding to PR test failure:
As below test log, these 9 PRs have been successfully validated on physical hardware platforms.
After merging, platform-specific integration testing will begin across various ASICs/SKUs/platforms, include KVM test. To prevent KVM test failures in CI during this validation phase, conditional marks will be added temporarily. Progress will be tracked via GitHub issues - platforms will be enabled incrementally as validation completes.

$ ./run_tests.sh -c qos/test_qos_probe.py::TestQosProbe::testQosPfcXoffProbe -t t0,any -n testbed-bjw2-can-t0-7260-2 -i ../ansible/bjw2,../ansible/veos -r -u -m individual -l info -k debug  -e "--skip_sanity --disable_loganalyzer --py_saithrift_url=${saithrift_bjw_brcm_202511}"
... omitted ...
--------------------------------------------- generated xml file: /var/src/sonic-mgmt-int/tests/logs/qos/test_qos_probe.py::TestQosProbe::testQosPfcXoffProbe.xml ---------------------------------------------
------------------------------------------------------------------------------------------- live log sessionfinish --------------------------------------------------------------------------------------------
25/02/2026 02:30:19 __init__.pytest_terminal_summary         L0067 INFO   | Can not get Allure report URL. Please check logs
=========================================================================================== short test summary info ===========================================================================================
SKIPPED [2] qos/test_qos_probe.py:81: Additional DSCPs are not supported on non-dual ToR ports
SKIPPED [4] qos/test_qos_probe.py:51: single_dut_multi_asic is not supported on T0 topologies
SKIPPED [12] qos/test_qos_probe.py:51: multi-dut is not supported on T0 topologies
=========================================================================== 2 passed, 18 skipped, 3 warnings in 1801.66s (0:30:01) ============================================================================
DEBUG:tests.conftest:[log_custom_msg] item: <Function testQosPfcXoffProbe[multi_dut_shortlink_to_longlink-xoff_4]>
INFO:root:Can not get Allure report URL. Please check logs
xuchen3@xuchen3-env-bj5:/var/src/sonic-mgmt-int/tests$ git log --oneline -n 10
ea5a3f102f (HEAD -> xuchen3/internal/mmu-probing.2-25.r2, origin/xuchen3/internal/mmu-probing.2-25.r2) sonic-mgmt__pr-22547__commit-4__-mmu-probing--pr09.test--Add-production-probe-test-and-infrastructure-updates.diff
84c350241f sonic-mgmt__pr-22546__commit-2__-mmu-probing--pr08.test--Add-integration-tests-for-end-to-end-probing-workflows.diff
7dd190f742 sonic-mgmt__pr-22545__commit-2__-mmu-probing--pr07.test--Add-comprehensive-unit-tests-for-probe-framework.diff
5cd33c2aa8 sonic-mgmt__pr-22544__commit-2__-mmu-probing--pr06.probe--Add-base-framework-and-all-probing-implementations.diff
83cf44f8b7 sonic-mgmt__pr-22543__commit-2__-mmu-probing--pr05.probe--Add-stream-manager-and-buffer-occupancy-controller.diff
ba639a8d7c sonic-mgmt__pr-22542__commit-2__-mmu-probing--pr04.probe--Add-observer-pattern-for-metrics-tracking.diff
bea8e30963 sonic-mgmt__pr-22541__commit-2__-mmu-probing--pr03.probe--Add-probing-executors-and-executor-registry.diff
9828290935 sonic-mgmt__pr-22540__commit-2__-mmu-probing--pr02.probe--Add-core-probing-algorithms-with-essential-data-structures.diff
3954ce3d97 sonic-mgmt__pr-22539__commit-1__-mmu-probing--pr01.docs--Add-MMU-threshold-probing-framework-design.diff
0b98d36716 (origin/internal, origin/HEAD) Merged PR 19235: Set dpu-pattern arg for smartswitch nightly runs
xuchen3@xuchen3-env-bj5:/var/src/sonic-mgmt-int/tests$

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@XuChen-MSFT
Copy link
Contributor Author

Regarding to PR test failure: As below test log, these 9 PRs have been successfully validated on physical hardware platforms. After merging, platform-specific integration testing will begin across various ASICs/SKUs/platforms, include KVM test. To prevent KVM test failures in CI during this validation phase, conditional marks will be added temporarily. Progress will be tracked via GitHub issues - platforms will be enabled incrementally as validation completes.

$ ./run_tests.sh -c qos/test_qos_probe.py::TestQosProbe::testQosPfcXoffProbe -t t0,any -n testbed-bjw2-can-t0-7260-2 -i ../ansible/bjw2,../ansible/veos -r -u -m individual -l info -k debug  -e "--skip_sanity --disable_loganalyzer --py_saithrift_url=${saithrift_bjw_brcm_202511}"
... omitted ...
--------------------------------------------- generated xml file: /var/src/sonic-mgmt-int/tests/logs/qos/test_qos_probe.py::TestQosProbe::testQosPfcXoffProbe.xml ---------------------------------------------
------------------------------------------------------------------------------------------- live log sessionfinish --------------------------------------------------------------------------------------------
25/02/2026 02:30:19 __init__.pytest_terminal_summary         L0067 INFO   | Can not get Allure report URL. Please check logs
=========================================================================================== short test summary info ===========================================================================================
SKIPPED [2] qos/test_qos_probe.py:81: Additional DSCPs are not supported on non-dual ToR ports
SKIPPED [4] qos/test_qos_probe.py:51: single_dut_multi_asic is not supported on T0 topologies
SKIPPED [12] qos/test_qos_probe.py:51: multi-dut is not supported on T0 topologies
=========================================================================== 2 passed, 18 skipped, 3 warnings in 1801.66s (0:30:01) ============================================================================
DEBUG:tests.conftest:[log_custom_msg] item: <Function testQosPfcXoffProbe[multi_dut_shortlink_to_longlink-xoff_4]>
INFO:root:Can not get Allure report URL. Please check logs
xuchen3@xuchen3-env-bj5:/var/src/sonic-mgmt-int/tests$ git log --oneline -n 10
ea5a3f102f (HEAD -> xuchen3/internal/mmu-probing.2-25.r2, origin/xuchen3/internal/mmu-probing.2-25.r2) sonic-mgmt__pr-22547__commit-4__-mmu-probing--pr09.test--Add-production-probe-test-and-infrastructure-updates.diff
84c350241f sonic-mgmt__pr-22546__commit-2__-mmu-probing--pr08.test--Add-integration-tests-for-end-to-end-probing-workflows.diff
7dd190f742 sonic-mgmt__pr-22545__commit-2__-mmu-probing--pr07.test--Add-comprehensive-unit-tests-for-probe-framework.diff
5cd33c2aa8 sonic-mgmt__pr-22544__commit-2__-mmu-probing--pr06.probe--Add-base-framework-and-all-probing-implementations.diff
83cf44f8b7 sonic-mgmt__pr-22543__commit-2__-mmu-probing--pr05.probe--Add-stream-manager-and-buffer-occupancy-controller.diff
ba639a8d7c sonic-mgmt__pr-22542__commit-2__-mmu-probing--pr04.probe--Add-observer-pattern-for-metrics-tracking.diff
bea8e30963 sonic-mgmt__pr-22541__commit-2__-mmu-probing--pr03.probe--Add-probing-executors-and-executor-registry.diff
9828290935 sonic-mgmt__pr-22540__commit-2__-mmu-probing--pr02.probe--Add-core-probing-algorithms-with-essential-data-structures.diff
3954ce3d97 sonic-mgmt__pr-22539__commit-1__-mmu-probing--pr01.docs--Add-MMU-threshold-probing-framework-design.diff
0b98d36716 (origin/internal, origin/HEAD) Merged PR 19235: Set dpu-pattern arg for smartswitch nightly runs
xuchen3@xuchen3-env-bj5:/var/src/sonic-mgmt-int/tests$

Added conditional mark for qos/test_qos_probe.py to skip MMU threshold probing tests on platforms pending validation. Created 9 GitHub issues to track joint debugging progress for each platform/ASIC type:

  1. VS platform (asic_type: vs) - Issue Enhancement: Enable MMU threshold probing test on VS platform #22599
  2. Cisco-8000 GB (Cisco-8102-C64, Cisco-8101-O8C48, Cisco-8102-28FH-DPU-O) - Issue Enhancement: Enable MMU threshold probing test on Cisco-8000 GB platforms #22601
  3. Cisco-8000 GR/GR2 (Cisco-8101-O32, Cisco-8101-O8V48) - Issue Enhancement: Enable MMU threshold probing test on Cisco-8000 GR/GR2 platforms #22602
  4. Broadcom TD3 (Arista-7050CX3-32C-C32, Arista-7050CX3-32S-C32) - Issue Enhancement: Enable MMU threshold probing test on Broadcom Trident 3 platforms #22603
  5. Broadcom TH (Arista-7060CX-32S-C32, Arista-7060CX-32S-Q32) - Issue Enhancement: Enable MMU threshold probing test on Broadcom Tomahawk platforms #22604
  6. Broadcom TH2 (Arista-7260CX3-C64, Arista-7260CX3-D108C8, Arista-7260CX3-D108C10) - Issue Enhancement: Enable MMU threshold probing test on Broadcom Tomahawk 2 platforms #22605
  7. Broadcom TH5 (Arista-7060X6-16PE-384C-B-O128S2, Arista-7060X6-64PE-B-O128) - Issue Enhancement: Enable MMU threshold probing test on Broadcom Tomahawk 5 platforms #22606
  8. Mellanox SPC1 (Mellanox-SN2700) - Issue Enhancement: Enable MMU threshold probing test on Mellanox Spectrum 1 platforms #22607
  9. Mellanox SPC3 (Mellanox-SN4600C-C64) - Issue Enhancement: Enable MMU threshold probing test on Mellanox Spectrum 3 platforms #22608

Each issue tracks the validation work to ensure MMU threshold probing tests run successfully without corner cases. Once validation completes for a platform, the corresponding conditional skip will be removed.

Conditional mark location: tests/common/plugins/conditional_mark/tests_mark_conditions.yaml line 3930

@XuChen-MSFT
Copy link
Contributor Author

Found a couple blocking issues:

  1. Typo in key:
    has a leading space. Likely should be .
  2. Type error in updateTestPortIdIp():
    passes a set; the helper mutates/indexes the list. Use a list instead (e.g., or keep as list).

These likely explain the static analysis failure. Please fix and re-run checks.

@yxieca Thanks for the review.
The static analysis failures were primarily due to flake8 formatting issues (e.g., line length, unused imports) and test configuration, which have all been resolved.
regarding the tests, I've added a conditional mark to temporarily skip the newly created python code on the VS platform pending further integration. I've opened tracking issues to manage the debugging and validation

kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…ic-net#22547)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it

In the case of ASIC detection failures on Broadcom (or if the ASIC couldn't be detected in time), the `/dev/shm` partition in the syncd container will be only 64MB, which might cause issues if syncd/Broadcom SAI library needs more space than that.

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it

Since using a larger `/dev/shm` on its own doesn't cause any issues, bump up the default to 512MB. This should be enough for most platforms.

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — PR #22547 (MMU Probing: Production Tests + Infrastructure)

This is the top of the 9-PR MMU probing stack. I reviewed the full series (#22539#22547). Overall the framework is well-architected — binary search threshold probing with pluggable algorithms/executors/observers is a solid design. However, I found 2 ship-blockers in this PR and several issues across the stack.

🔴 Ship-Blockers

1. Typo in config key — " breakout" with leading space
In test_qos_probe.py (~line 449):

qosConfig = dutQosConfig["param"][portSpeedCableLength][" breakout"]

Note the leading space in " breakout". This will KeyError at runtime for any breakout SKU running testQosIngressDropProbe. Compare with the correct ["breakout"] (no space) used in testQosPfcXoffProbe.

2. set() passed where list is required
In qos_sai_base.py (~line 174):

pytest_assert(self.replaceNonExistentPortId(testPortIds, set(portIds)), ...)

replaceNonExistentPortId does portIds[idx] = freePorts.pop(0) — index assignment on a set raises TypeError. Sets don't support indexing.

⚠️ Design Issues

  • find_cell_size() duplicated in 3 test methods — identical recursive search helper in testQosPfcXoffProbe, testQosIngressDropProbe, and testQosHeadroomPoolProbe. Should be a class method or standalone utility.

  • src_port_vlans may be unbound — in testQosHeadroomPoolProbe, it's set only inside the if platform_asic == "broadcom-dnx" block but appears to be referenced later unconditionally for DNX platforms.

  • in_py3 variable name misleading in ptf_runner.py — now True for both py3 and probe directories. Name should reflect the broader meaning.

❓ Question

  • tests_mark_conditions.yaml — some skip conditions include GitHub issue URLs in the condition string (e.g., "asic_type in ['vs'] and https://github.com/..."). Does the conditional_mark plugin actually parse these? The URL portion would be a NameError in Python eval().

Cross-Stack Issues (from #22540, #22541, #22544)

See individual PR reviews for details, but the most important ones:

  • #22540: Infinite loop in lower-bound algorithm when current reaches 1 (max(1//2, 1) == 1 forever)
  • #22540: Point algorithm continues sending incremental traffic after a failure (corrupted buffer state)
  • #22541: Missing self.observer None guard in ingress_drop_probing_executor.py verbose trace block
  • #22544: continue on PG failure skips buffer cleanup — next PG probes with corrupted buffer state
  • #22544: Massive code duplication in _create_algorithms() between PfcXoff and IngressDrop classes (~120 identical lines each)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Thanks for the thorough review across the entire 9-PR stack. Addressed all items for this PR below.

✅ Fixed (4 commits pushed)

1. " breakout" typo (Ship-Blocker) — Fixed. Removed the leading space in the config key at line 206 of test_qos_probe.py. (cbdfaab)

2. set(portIds) type error (Ship-Blocker) — Changed to list(portIds) in qos_sai_base.py. Note: this is a pre-existing bug inherited from test_qos_sai.py (introduced in PR #8149 by @vmittal-msft). The set() was intended for deduplication but set doesn't support item assignment which replaceNonExistentPortId() requires. It rarely triggered in practice because the index-assignment path only executes when a port is invalid and needs replacement — uncommon on most testbeds. (d74eeb7)

3. find_cell_size() duplication — Extracted the 3 identical nested definitions into a single @staticmethod on TestQosProbe. (9a97b3c)

4. in_py3 variable name — Renamed to in_subdir in ptf_runner.py with updated docstring. The boolean originally only indicated py3/; now it also covers the newly added probe/ subdirectory for the MMU threshold probing framework. (6ce1d4c)

ℹ️ Acknowledged — Deferring to Validation Phase

5. src_port_vlans potentially unbound — Good catch. Both the assignment (L330-359) and usage (L544-546) are guarded by the same platform_asic == "broadcom-dnx" condition, so it won't cause a NameError at runtime. That said, the logic here is intentionally kept consistent with the original testQosSaiHeadroomPoolSize in test_qos_sai.py — the code was ported to preserve the existing behavior.

I'd prefer not to refactor this path preemptively at this stage for two reasons:

  1. The current hardware validation (Broadcom TD3/TH2, Cisco Q201L, Mellanox SPC1/SPC3) has not surfaced any issue with this pattern.
  2. There are 9 platform-specific tracking issues (Enhancement: Enable MMU threshold probing test on VS platform #22599Enhancement: Enable MMU threshold probing test on Mellanox Spectrum 3 platforms #22608) for joint debugging across all ASIC types. As each SKU goes through validation, any edge case in this area will be caught and fixed with a concrete reproduction — which leads to more accurate fixes than speculative changes.

Will revisit once the cross-platform validation rounds are complete.

ℹ️ By Design — No Change Needed

6. URL in tests_mark_conditions.yaml — This is by design. The conditional_mark plugin's update_issue_status() function (in __init__.py) extracts URLs from condition strings via regex, queries GitHub issue status (open/closed), and replaces them with True/False before eval(). So "asic_type in ['vs'] and https://...#22599" becomes "asic_type in ['vs'] and True" when the issue is open. This is a well-established pattern used in 50+ entries across the YAML file.

Cross-Stack Issues (#22540, #22541, #22544)

Will address in the respective PRs separately — thanks for flagging them.

XuChen-MSFT and others added 9 commits March 23, 2026 12:37
Enable probe framework for production testbed usage with infrastructure
updates and physical hardware test cases.

Infrastructure Updates:

1. tests/conftest.py:
   - Add --enable_qos_ptf_pdb option for PTF debugging with pdb breakpoint
   - Add --ingress_drop_probing option to switch between PFC/Drop probing modes

2. tests/ptf_runner.py:
   - Add 'probe' subdirectory support alongside 'py3'
   - Add test_subdir parameter for flexible PTF test location
   - Enable probe tests to run via PTF runner infrastructure

3. tests/qos/qos_sai_base.py (QosSaiBase refactoring):
   - Move replaceNonExistentPortId() from TestQosSai to base class
   - Move updateTestPortIdIp() from TestQosSai to base class
   - Add bufferConfig to dut_qos_maps fixture for all devices
   - Enable probe tests to access buffer configuration
   - Shared utility methods for port ID/IP management

4. tests/qos/test_qos_sai.py:
   - Remove replaceNonExistentPortId() (moved to base)
   - Remove updateTestPortIdIp() (moved to base)
   - Reduce code duplication

Production Test Cases:

5. tests/qos/test_qos_probe.py (NEW - 544 lines):
   - TestQosProbe class for physical testbed probing
   - test_pfc_xoff_probing: PFC Xoff threshold detection on hardware
   - test_ingress_drop_probing: Ingress drop threshold detection
   - test_headroom_pool_probing: Headroom pool size probing
   - Integrates with existing QoS test infrastructure
   - Uses physical executors for real hardware validation
   - Validates probe framework on production testbeds

This PR completes the probe framework integration, enabling threshold
probing tests to run on physical SONiC testbeds alongside existing QoS tests.

Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Fix KeyError in testQosIngressDropProbe for breakout SKUs.
Line 206 had [" breakout"] (with leading space) instead of ["breakout"].

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
Deduplicate find_cell_size() which was identically defined 3 times
as nested functions inside testQosPfcXoffProbe, testQosIngressDropProbe,
and testQosHeadroomPoolProbe. Now a single @staticmethod on TestQosProbe.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
set() does not support index assignment (portIds[idx] = ...) which
replaceNonExistentPortId() uses internally. This is a pre-existing bug
from PR sonic-net#8149 (test_qos_sai.py L345), moved here during refactoring.
The set() was likely intended for deduplication but breaks item
assignment. In practice it rarely triggered because most testbeds
have all valid ports, so the index assignment path was never reached.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
The boolean originally indicated whether a test file was in the py3/
subdirectory. With the addition of the probe/ subdirectory for the MMU
threshold probing framework, the variable now indicates whether the
test is in any subdirectory (py3 or probe). Rename for clarity.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@XuChen-MSFT XuChen-MSFT force-pushed the xuchen3/mmu_probe/pr09-production branch from 6ce1d4c to 98db457 Compare March 23, 2026 04:38
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Approve with 4 findings


[High] Broadcom-specific bcmcmd runs on all platforms
test_qos_probe.py:632-647

testQosHeadroomPoolProbe runs bcmcmd "knetctrl netif show" without first checking sonic_asic_type == "broadcom". The ASIC type check at line 653 only guards TD2/TD3 filtering logic, not the initial bcmcmd invocations. This will crash on Mellanox/Cisco with "command not found".

Suggested fix: Wrap all bcmcmd-related code in if dutTestParams["basicParams"]["sonic_asic_type"] == "broadcom":, and provide a skip or alternative path for non-Broadcom platforms.


[Medium] max() on potentially empty dict
test_qos_probe.py:707

max(xpe_to_testports.keys(), ...)

If xpe_to_testports is empty (bcmcmd fails, unexpected output, all ports filtered), max() raises ValueError.

Suggested fix:

if not xpe_to_testports:
    pytest.skip("No available test ports found for probing")

[Low] setlist fix in replaceNonExistentPortId
qos_sai_base.py:189

Changed from set(portIds) to list(portIds). This is actually a bug fix — the method uses portIds[idx] = ... which requires list indexing. Just verify no other callers still pass sets.


[Low] Unused --ingress_drop_probing CLI option
conftest.py:46

The option is defined with parser.addoption("--ingress_drop_probing", ...) but never consumed via getoption. Both test methods run as separate parametrized tests regardless. Consider implementing the gating logic or removing the unused option.

XuChen-MSFT and others added 3 commits March 26, 2026 21:52
bcmcmd commands (knetctrl netif show, show pmap) are Broadcom-specific
and crash on Mellanox/Cisco with 'command not found'. Wrapped entire
XPE port mapping logic in 'if sonic_asic_type == broadcom:' guard.
Non-Broadcom platforms use all available test ports directly.

Addresses @StormLiangMS review (2026-03-25): bcmcmd runs on all platforms.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
max() on empty dict raises ValueError. Skip test gracefully when
no available test ports are found after XPE mapping.

Addresses @StormLiangMS review (2026-03-25): max() on potentially empty dict.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
After bcmcmd guard, variables like bcmport_to_sonicport, xpe_to_bcmports
are only defined inside the broadcom branch. Logger.info must only
reference variables available in both branches.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Xu Chen <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@XuChen-MSFT
Copy link
Contributor Author

@StormLiangMS Re: 4 findings

[High] bcmcmd on non-Broadcom — Fixed (d001e4f): Wrapped entire XPE port mapping logic in if sonic_asic_type == "broadcom":. Non-Broadcom platforms use src_testPortIds directly.

[Medium] max() on empty dict — Fixed (3d8746f): Added if not xpe_to_testports: pytest.skip(...) guard before max().

[Low] set→list in replaceNonExistentPortId — Already fixed in earlier commit (d74eeb7 / c2ff13f). No other callers pass sets — verified by searching all replaceNonExistentPortId call sites.

[Low] Unused --ingress_drop_probing CLI option — This is a pre-reserved parameter for future gating logic (switching between PFC/IngressDrop probing modes via CLI). Currently both tests run as separate methods. Will implement the gating or remove when the probing mode selection is finalized.

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants