[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows#22546
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Blocking issues:
These likely explain the Pre_test Static Analysis failure. Please fix and re-run checks. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Below is sample test log for running this integration tests for probing workflows. |
@yxieca Thanks for the review. |
|
Deep review done; overall looks good. Two minor nits:
Also DCO is failing — please add sign-off and update commits. |
…ic-net#22546) <!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it In the case of ASIC detection failures on Broadcom (or if the ASIC couldn't be detected in time), the `/dev/shm` partition in the syncd container will be only 64MB, which might cause issues if syncd/Broadcom SAI library needs more space than that. ##### Work item tracking - Microsoft ADO **(number only)**: #### How I did it Since using a larger `/dev/shm` on its own doesn't cause any issues, bump up the default to 512MB. This should be enough for most platforms. #### How to verify it <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Added 3 integration tests + Python 3.12 compatibility fix ( New IT tests:
Python 3.12 fix in probe_test_helper.py:
IT total: 62 → 65. See PR #22540 for the corresponding source code fixes. |
|
/azp run |
|
Added 3 more IT tests for ingress drop probing (
IT total: 65 → 68. Same patterns as PFC XOFF boundary tests. |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Added 2 integration tests for anti-oscillation validation (
IT total: 68 -> 70. Validates algorithm fix in PR #22540 ( |
When candidate_threshold is small (e.g. 10), precision target candidate * 0.05 = 0.5 < 1. With bad_spot at the threshold value, range_size stays at 1 but 1 <= 0.5 is never satisfied, burning all 50 max_iterations. Use max(1, ...) to ensure precision check can terminate when range narrows to 1 packet granularity. Validated by UT (PR sonic-net#22545) and IT (PR sonic-net#22546) — both FAIL without this fix (50 iterations), PASS with fix (~18 iterations). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
|
/azp run |
|
Added 2 ITs for precision check with small threshold + bad_spot (
Without fix: Phase 3 burns 50 iterations (max_iterations). With fix: ~18 iterations (exits via precision_reached). |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Refactored multi-PG probe loop from 6 scattered 'continue' statements to while-True single-pass block with unified cleanup: - break + fail_reason on any phase failure - pg_success flag tracks completion - Single drain_buffer([dst_port_id]) call in cleanup block This ensures buffer state is always drained before moving to the next PG, preventing corrupted buffer from affecting subsequent PG probing. UT coverage: PR sonic-net#22545 (3d75029) — 7 new tests IT coverage: PR sonic-net#22546 (14a29c2) — 2 new tests Addresses @StormLiangMS review: continue on PG failure skips buffer cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Added 2 ITs for multi-PG buffer isolation in headroom pool (
Validates fix in PR #22544 ( IT headroom total: 15 → 17. |
|
@yxieca Re: ast.literal_eval type guard This has been addressed — |
|
@yxieca Re: sys.modules patching isolation Currently isolated by 3 mechanisms:
This will be further validated during lightning pipeline integration, where the actual test execution flow (PTF runner → SAI tests) will confirm that sys.modules patching in IT does not affect physical test execution. |
|
@XuChen-MSFT can you address the pre check failure and DCO? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
When candidate_threshold is small (e.g. 10), precision target candidate * 0.05 = 0.5 < 1. With bad_spot at the threshold value, range_size stays at 1 but 1 <= 0.5 is never satisfied, burning all 50 max_iterations. Use max(1, ...) to ensure precision check can terminate when range narrows to 1 packet granularity. Validated by UT (PR sonic-net#22545) and IT (PR sonic-net#22546) — both FAIL without this fix (50 iterations), PASS with fix (~18 iterations). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
Refactored multi-PG probe loop from 6 scattered 'continue' statements to while-True single-pass block with unified cleanup: - break + fail_reason on any phase failure - pg_success flag tracks completion - Single drain_buffer([dst_port_id]) call in cleanup block This ensures buffer state is always drained before moving to the next PG, preventing corrupted buffer from affecting subsequent PG probing. UT coverage: PR sonic-net#22545 (3d75029) — 7 new tests IT coverage: PR sonic-net#22546 (14a29c2) — 2 new tests Addresses @StormLiangMS review: continue on PG failure skips buffer cleanup. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
Implement comprehensive integration tests for complete probing workflows using simulation executors for reproducible end-to-end testing. Test Infrastructure: - __init__.py: Integration test module initialization - conftest.py: Shared pytest fixtures for integration testing - pytest.ini: Pytest configuration for integration test suite - probe_test_helper.py: Helper utilities and test orchestration - Simulation environment setup - PTF mock integration - Test scenario builders - Assertion helpers for threshold validation Integration Test Suites: 1. test_pfc_xoff_probing.py (883 lines): - End-to-end PFC Xoff threshold detection workflows - Tests all three algorithm phases (UpperBound → LowerBound → ThresholdRange) - Validates observer metrics collection - Tests buffer state management - Multi-port probing scenarios 2. test_ingress_drop_probing.py (575 lines): - End-to-end ingress drop threshold detection workflows - Tests algorithm sequence (UpperBound → LowerBound → ThresholdPoint) - Validates drop detection accuracy - Tests traffic pattern variations 3. test_headroom_pool_probing.py (632 lines): - End-to-end headroom pool size probing workflows (N→1 pattern) - Multi-priority-group iteration testing - Tests PG-level threshold detection - Validates pool size calculation All integration tests use simulation executors to ensure deterministic, reproducible results without requiring physical hardware, enabling CI/CD pipeline integration. Signed-off-by: Xu Chen <xuchen3@microsoft.com>
Signed-off-by: Xu Chen <xuchen3@microsoft.com>
New IT test cases (3): - test_pfc_xoff_threshold_at_one: boundary value 1 (lower-bound break) - test_pfc_xoff_threshold_at_two: boundary value 2 (binary search min) - test_pfc_xoff_point_probing_with_intermittent_failures: drain recovery Python 3.12 compatibility fix in probe_test_helper.py: - Add __path__ attribute to scapy mock (required by Python 3.12+ import system to recognize MagicMock as a package) - Register scapy.layers and scapy.layers.inet6 submodule mocks - Backward compatible with Python 3.8 IT total: 62 -> 65 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
New IT test cases (3): - test_ingress_drop_threshold_at_one: boundary value 1 - test_ingress_drop_threshold_at_two: boundary value 2 - test_ingress_drop_point_probing_with_intermittent_failures: drain recovery IT total: 65 -> 68 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
New IT tests (+2): - PFC XOFF: test_pfc_xoff_range_oscillation_high_failure_rate - Ingress Drop: test_ingress_drop_range_oscillation_bad_spot Both use bad_spot scenario to verify Phase 3 anti-oscillation: capture observer markdown output, parse candidate column, assert no candidate is tested more than 3 times. IT total: 68 -> 70 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
New IT cases (+2): - test_pfc_xoff_small_threshold_precision: threshold=10, bad_spot=[10] - test_ingress_drop_small_threshold_precision: same pattern Both capture Phase 3 iteration count — without fix: 50 (max_iterations), with fix: ~18 (exits via precision_reached). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
- test_headroom_pool_buffer_cleanup_on_pg_failure: 2 PGs, verify probe completes without crash when PG fails - test_headroom_pool_multi_pg_isolation: 3 PGs, verify all PGs produce independent results Related: PR sonic-net#22544 fix (while-True unified cleanup) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Xu Chen <xuchen3@microsoft.com>
- E741: rename ambiguous variable 'l' to 'line' in list comprehensions (test_ingress_drop_probing.py, test_pfc_xoff_probing.py) - F541: remove unnecessary f-string prefix from string without placeholders (test_pfc_xoff_probing.py) Signed-off-by: Xu Chen <xuchen3@microsoft.com>
2467d69 to
1919215
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
LGTM. AI agent on behalf of Ying.
… workflows (sonic-net#22546) What is the motivation for this PR\nqos refactoring\n\nHow did you do it\nImplement comprehensive integration tests for complete probing workflows using simulation executors for reproducible end-to-end testing.\n\nHow did you verify/test it\nNot specified in PR.\n\nSigned-off-by\nSigned-off-by: Xu Chen <xuchen3@microsoft.com>
Description of PR
Summary:
Implement comprehensive integration tests for complete probing workflows using simulation executors for reproducible end-to-end testing.
Test Infrastructure:
Integration Test Suites:
test_pfc_xoff_probing.py (883 lines):
test_ingress_drop_probing.py (575 lines):
test_headroom_pool_probing.py (632 lines):
All integration tests use simulation executors to ensure deterministic, reproducible results without requiring physical hardware, enabling CI/CD pipeline integration.
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
qos refactoring
How did you do it?
How did you verify/test it?
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
relevant PRs:
[mmu probing] pr01.docs: Add MMU threshold probing framework design
[mmu probing] pr02.probe: Add core probing algorithms with essential data structures
[mmu probing] pr03.probe: Add probing executors and executor registry
[mmu probing] pr04.probe: Add observer pattern for metrics tracking
[mmu probing] pr05.probe: Add stream manager and buffer occupancy controller
[mmu probing] pr06.probe: Add base framework and all probing implementations
[mmu probing] pr07.test: Add comprehensive unit tests for probe framework
[mmu probing] pr08.test: Add integration tests for end-to-end probing workflows
[mmu probing] pr09.test: Add production probe test and infrastructure updates