Skip to content

feat: add fixture for disabling route check#16876

Merged
yejianquan merged 3 commits intosonic-net:masterfrom
cyw233:add-disable-route-check-fixture
Feb 12, 2025
Merged

feat: add fixture for disabling route check#16876
yejianquan merged 3 commits intosonic-net:masterfrom
cyw233:add-disable-route-check-fixture

Conversation

@cyw233
Copy link
Contributor

@cyw233 cyw233 commented Feb 10, 2025

Description of PR

Add a module-level fixture for temporarily disabling route check for a test module

Summary:
Fixes # (issue) Microsoft ADO 31326413

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:

E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E               Exception:
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Feb  3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{#012    "asic1": {#012        "Unaccounted_ROUTE_ENTRY_TABLE_entries": [#012            "100.1.0.22/32",#012

After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.

How did you do it?

How did you verify/test it?

I ran the updated code and can confirm it's working well.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@cyw233 cyw233 requested a review from prgeor as a code owner February 10, 2025 10:08
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@cyw233 cyw233 requested review from abdosi and yejianquan and removed request for prgeor February 10, 2025 22:51
yield

if check_flag:
logging.info("!!!!route check teardown!!!!!!")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a debug log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😸 yep, let me delete it


if check_flag:
logging.info("!!!!route check teardown!!!!!!")
with SafeThreadPoolExecutor(max_workers=8) as executor:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do another route check after the test to make sure route is healthy after test module?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, let me add it. If so, we need to probably use try...except...finally to always sudo monit start routeCheck in the fixture teardown stage

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@abdosi
Copy link
Contributor

abdosi commented Feb 11, 2025

don't see below change from my original PR. This will be good to do before stop/start of route check monit

 rc = duthost.shell("sudo route_check.py", module_ignore_errors=True)
        if rc['rc'] != 0:
            pytest.fail("route_check fail in test pre-setup stage")

@cyw233
Copy link
Contributor Author

cyw233 commented Feb 11, 2025

don't see below change from my original PR. This will be good to do before stop/start of route check monit

 rc = duthost.shell("sudo route_check.py", module_ignore_errors=True)
        if rc['rc'] != 0:
            pytest.fail("route_check fail in test pre-setup stage")

Hey @abdosi, the above code is already included in the run_route_check() function, and we use python multhreading to run it on all frontend nodes before stop and start of route check monit. Thanks

with SafeThreadPoolExecutor(max_workers=8) as executor:
    for duthost in duthosts.frontend_nodes:
        executor.submit(run_route_check, duthost)

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Collaborator

@yejianquan yejianquan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yejianquan yejianquan merged commit 5427836 into sonic-net:master Feb 12, 2025
13 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 12, 2025
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module

Summary:
Fixes # (issue) Microsoft ADO 31326413

Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:

E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E               Exception:
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Feb  3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{sonic-net#12    "asic1": {sonic-net#12        "Unaccounted_ROUTE_ENTRY_TABLE_entries": [sonic-net#12            "100.1.0.22/32",sonic-net#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.

How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.

co-authorized by: [email protected]
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202411: #16924

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202405: Azure/sonic-mgmt.msft#75

mssonicbld pushed a commit that referenced this pull request Feb 13, 2025
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module

Summary:
Fixes # (issue) Microsoft ADO 31326413

Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:

E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E               Exception:
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Feb  3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{#12    "asic1": {#12        "Unaccounted_ROUTE_ENTRY_TABLE_entries": [#12            "100.1.0.22/32",#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.

How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.

co-authorized by: [email protected]
nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Mar 15, 2025
Description of PR
Add a module-level fixture for temporarily disabling route check for a test module

Summary:
Fixes # (issue) Microsoft ADO 31326413

Approach
What is the motivation for this PR?
In our recent Cisco T2 Nightly run, we observed that we would get the following error syslog during some test modules:

E               Failed: Processes "['analyze_logs--<MultiAsicSonicHost dut-lc1-1>']" failed with exit code "1"
E               Exception:
E               match: 1
E               expected_match: 0
E               expected_missing_match: 0
E               
E               Match Messages:
E               2025 Feb  3 03:03:29.550827 svcstr2-8800-lc1-1 ERR monit[914]: 'routeCheck' status failed (255) -- Failure results: {{sonic-net#12    "asic1": {sonic-net#12        "Unaccounted_ROUTE_ENTRY_TABLE_entries": [sonic-net#12            "100.1.0.22/32",sonic-net#12
After discussion, we decided to add a fixture so users can disable route check for a test module if they think that test tends to have such error syslog.

How did you do it?
How did you verify/test it?
I ran the updated code and can confirm it's working well.

co-authorized by: [email protected]
@ZhaohuiS
Copy link
Contributor

ZhaohuiS commented Dec 5, 2025

@cyw233 this route check failures only happen on T2 chassis, right? But the test module is for all platforms, as I notice, route check didn't fail on T0/T1 platforms, is it probably to hide potential risk for other platforms?
Can we just enable this fixture only for impacted T2 platform?
Same concern for #21320.

@cyw233 cyw233 mentioned this pull request Dec 6, 2025
12 tasks
StormLiangMS pushed a commit that referenced this pull request Dec 11, 2025
…1627)

What is the motivation for this PR?
The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see #16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR.

In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that:

Users can choose which topologies apply the disable-and-enable routeCheck behavior
The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step
How did you do it?
How did you verify/test it?
I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well:
https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930
image

I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c
image

Any platform specific information?
wangxin pushed a commit that referenced this pull request Dec 12, 2025
Change the temporarily_disable_route_check fixture logic to only apply to T2 topology for now.

What is the motivation for this PR?
The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see #16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR.

In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that:

Users can choose which topologies apply the disable-and-enable routeCheck behavior
The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step
How did you do it?
How did you verify/test it?
I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well:
https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930
image

I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c
image

Signed-off-by: Chenyang Wang <[email protected]>
Co-authored-by: Chenyang Wang <[email protected]>
mssonicbld added a commit to mssonicbld/sonic-mgmt.msft that referenced this pull request Dec 12, 2025
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->
Change the `temporarily_disable_route_check` fixture logic to only apply to T2 topology for now.

Summary:
Fixes # (issue) Microsoft ADO 36101536

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [x] 202505
- [x] 202511

### Approach
#### What is the motivation for this PR?
The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see sonic-net/sonic-mgmt#16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR.

In a follow-up PR, I will properly enhance the `temporarily_disable_route_check` fixture so that:
- Users can choose which topologies apply the disable-and-enable routeCheck behavior
- The fixture uses a `wait_until()` timeout to verify the routeCheck status is as expected before proceeding to the next step

#### How did you do it?

#### How did you verify/test it?
I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well:
https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930
<img width="1609" height="202" alt="image" src="https://github.com/user-attachments/assets/e631a351-1372-412d-bca7-6b4ef5d8112a" />

I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c
<img width="1963" height="546" alt="image" src="https://github.com/user-attachments/assets/261dd79b-c847-4a82-9409-d87e48a3cfa8" />

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
cyw233 pushed a commit to Azure/sonic-mgmt.msft that referenced this pull request Dec 12, 2025
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->
Change the `temporarily_disable_route_check` fixture logic to only apply
to T2 topology for now.

Summary:
Fixes # (issue) Microsoft ADO 36101536

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [x] 202505
- [x] 202511

### Approach
#### What is the motivation for this PR?
The current disable-and-enable routeCheck monitor logic is causing test
flakiness on some non-T2 platforms (see
sonic-net/sonic-mgmt#16876 (comment)).
Certain platforms require additional time to restart the routeCheck
monitor, which can leave it inactive when the next test begins and
result in false failures. We would like to address this issue urgently
in this PR.

In a follow-up PR, I will properly enhance the
`temporarily_disable_route_check` fixture so that:
- Users can choose which topologies apply the disable-and-enable
routeCheck behavior
- The fixture uses a `wait_until()` timeout to verify the routeCheck
status is as expected before proceeding to the next step

#### How did you do it?

#### How did you verify/test it?
I ran the updated login on a non-T2 platform (Mx) and can confirm it's
working well:
https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930
<img width="1609" height="202" alt="image"
src="https://github.com/user-attachments/assets/e631a351-1372-412d-bca7-6b4ef5d8112a"
/>

I also verified the logic on T2 platform and can confirm it's still
having this logic:
https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c
<img width="1963" height="546" alt="image"
src="https://github.com/user-attachments/assets/261dd79b-c847-4a82-9409-d87e48a3cfa8"
/>

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
lakshmi-nexthop pushed a commit to lakshmi-nexthop/sonic-mgmt that referenced this pull request Feb 11, 2026
…1613)

Change the temporarily_disable_route_check fixture logic to only apply to T2 topology for now.

What is the motivation for this PR?
The current disable-and-enable routeCheck monitor logic is causing test flakiness on some non-T2 platforms (see sonic-net#16876 (comment)). Certain platforms require additional time to restart the routeCheck monitor, which can leave it inactive when the next test begins and result in false failures. We would like to address this issue urgently in this PR.

In a follow-up PR, I will properly enhance the temporarily_disable_route_check fixture so that:

Users can choose which topologies apply the disable-and-enable routeCheck behavior
The fixture uses a wait_until() timeout to verify the routeCheck status is as expected before proceeding to the next step
How did you do it?
How did you verify/test it?
I ran the updated login on a non-T2 platform (Mx) and can confirm it's working well:
https://elastictest.org/scheduler/testplan/693272f7392767e9bf67e930
image

I also verified the logic on T2 platform and can confirm it's still having this logic: https://elastictest.org/scheduler/testplan/6932767fbcc3fac23371a83c
image

Signed-off-by: Chenyang Wang <[email protected]>
Co-authored-by: Chenyang Wang <[email protected]>
Signed-off-by: Lakshmi Yarramaneni <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants