feat: add fixture for disabling route check#16876
feat: add fixture for disabling route check#16876yejianquan merged 3 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
tests/conftest.py
Outdated
| yield | ||
|
|
||
| if check_flag: | ||
| logging.info("!!!!route check teardown!!!!!!") |
There was a problem hiding this comment.
😸 yep, let me delete it
tests/conftest.py
Outdated
|
|
||
| if check_flag: | ||
| logging.info("!!!!route check teardown!!!!!!") | ||
| with SafeThreadPoolExecutor(max_workers=8) as executor: |
There was a problem hiding this comment.
Should we do another route check after the test to make sure route is healthy after test module?
There was a problem hiding this comment.
Good idea, let me add it. If so, we need to probably use try...except...finally to always sudo monit start routeCheck in the fixture teardown stage
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
don't see below change from my original PR. This will be good to do before stop/start of route check monit |
Hey @abdosi, the above code is already included in the |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module
Summary:
Fixes # (issue) Microsoft ADO 31326413
Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:
E Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E Exception:
E match: 1
E expected_match: 0
E expected_missing_match: 0
E
E Match Messages:
E 2025 Feb 3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{sonic-net#12 "asic1": {sonic-net#12 "Unaccounted_ROUTE_ENTRY_TABLE_entries": [sonic-net#12 "100.1.0.22/32",sonic-net#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.
How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.
co-authorized by: [email protected]
|
Cherry-pick PR to 202411: #16924 |
|
Cherry-pick PR to msft-202405: Azure/sonic-mgmt.msft#75 |
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module
Summary:
Fixes # (issue) Microsoft ADO 31326413
Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:
E Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E Exception:
E match: 1
E expected_match: 0
E expected_missing_match: 0
E
E Match Messages:
E 2025 Feb 3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{#12 "asic1": {#12 "Unaccounted_ROUTE_ENTRY_TABLE_entries": [#12 "100.1.0.22/32",#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.
How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.
co-authorized by: [email protected]
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module
Summary:
Fixes # (issue) Microsoft ADO 31326413
Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:
E Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E Exception:
E match: 1
E expected_match: 0
E expected_missing_match: 0
E
E Match Messages:
E 2025 Feb 3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{sonic-net#12 "asic1": {sonic-net#12 "Unaccounted_ROUTE_ENTRY_TABLE_entries": [sonic-net#12 "100.1.0.22/32",sonic-net#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.
How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.
co-authorized by: [email protected]
|
@cyw233 this route check failures only happen on T2 chassis, right? But the test module is for all platforms, as I notice, route check didn't fail on T0/T1 platforms, is it probably to hide potential risk for other platforms? |
…1627) What is the motivation for this PR? The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see #16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR. In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that: Users can choose which topologies apply the disable-and-enable routeCheck behavior The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step How did you do it? How did you verify/test it? I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well: https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930 image I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c image Any platform specific information?
Change the temporarily_disable_route_check fixture logic to only apply to T2 topology for now. What is the motivation for this PR? The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see #16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR. In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that: Users can choose which topologies apply the disable-and-enable routeCheck behavior The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step How did you do it? How did you verify/test it? I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well: https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930 image I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c image Signed-off-by: Chenyang Wang <[email protected]> Co-authored-by: Chenyang Wang <[email protected]>
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Change the `temporarily_disable_route_check` fixture logic to only apply to T2 topology for now. Summary: Fixes # (issue) Microsoft ADO 36101536 ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [x] 202505 - [x] 202511 ### Approach #### What is the motivation for this PR? The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see sonic-net/sonic-mgmt#16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR. In a follow-up PR, I will properly enhance the `temporarily_disable_route_check` fixture so that: - Users can choose which topologies apply the disable-and-enable routeCheck behavior - The fixture uses a `wait_until()` timeout to verify the routeCheck status is as expected before proceeding to the next step #### How did you do it? #### How did you verify/test it? I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well: https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930 <img width="1609" height="202" alt="image" src="https://github.com/user-attachments/assets/e631a351-1372-412d-bca7-6b4ef5d8112a" /> I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c <img width="1963" height="546" alt="image" src="https://github.com/user-attachments/assets/261dd79b-c847-4a82-9409-d87e48a3cfa8" /> #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Change the `temporarily_disable_route_check` fixture logic to only apply to T2 topology for now. Summary: Fixes # (issue) Microsoft ADO 36101536 ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [x] 202505 - [x] 202511 ### Approach #### What is the motivation for this PR? The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see sonic-net/sonic-mgmt#16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR. In a follow-up PR, I will properly enhance the `temporarily_disable_route_check` fixture so that: - Users can choose which topologies apply the disable-and-enable routeCheck behavior - The fixture uses a `wait_until()` timeout to verify the routeCheck status is as expected before proceeding to the next step #### How did you do it? #### How did you verify/test it? I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well: https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930 <img width="1609" height="202" alt="image" src="https://github.com/user-attachments/assets/e631a351-1372-412d-bca7-6b4ef5d8112a" /> I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c <img width="1963" height="546" alt="image" src="https://github.com/user-attachments/assets/261dd79b-c847-4a82-9409-d87e48a3cfa8" /> #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
…1613) Change the temporarily_disable_route_check fixture logic to only apply to T2 topology for now. What is the motivation for this PR? The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see sonic-net#16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR. In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that: Users can choose which topologies apply the disable-and-enable routeCheck behavior The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step How did you do it? How did you verify/test it? I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well: https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930 image I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c image Signed-off-by: Chenyang Wang <[email protected]> Co-authored-by: Chenyang Wang <[email protected]> Signed-off-by: Lakshmi Yarramaneni <[email protected]>
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module
Summary:
Fixes # (issue) Microsoft ADO 31326413
Type of change
Back port request
Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.
How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation