Fix flakiness in pfcwd/test_pfcwd_cli.py by vivekverma-arista · Pull Request #17411 · sonic-net/sonic-mgmt

vivekverma-arista · 2025-03-07T09:20:56Z

Description of PR

Summary: Fix flakiness in pfcwd/test_pfcwd_cli.py
Fixes #383

Type of change

Back port request

Approach

What is the motivation for this PR?

Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?

The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?

Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2025-03-07T09:20:59Z

/azp run

azure-pipelines · 2025-03-07T09:21:04Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

StormLiangMS · 2025-03-13T01:58:47Z

/azp run Azure.sonic-mgmt

azure-pipelines · 2025-03-13T01:58:58Z

Azure Pipelines successfully started running 1 pipeline(s).

lipxu · 2025-03-13T04:11:45Z

tests/pfcwd/test_pfcwd_cli.py


 class TestPfcwdFunc(SetupPfcwdFunc):
    """ Test PFC function and supporting methods """
+    def __shutdown_lag_members(self, duthost, selected_port):


Thanks for fixing this issue. But still not understand the situation clearly.
Do you mean, if the selected interface which was one of lag, and only this interface has the Storm, the drop count would be mismatch?

Yes drop count will mismatch because some of the traffic won't even make it to the port being stormed, it will successfully egress through other LAG members.

Thanks for your information, I checked one of the failure logs, the drop count from "show interfaces counters" and "show pfcwd stats" seems mismatch. but sure whether it is related to this issue.
Could you please share the test result? Will this fix make the case 100% pass? thanks

show interfaces counters: IFACE STATE RX_OK RX_BPS RX_UTIL RX_ERR RX_DRP RX_OVR TX_OK TX_BPS TX_UTIL TX_ERR TX_DRP TX_OVR ----------- ------- ------- --------- --------- -------- -------- -------- ------- ---------- --------- -------- -------- -------- Ethernet88 U 11 14.38 MB/s 0.12% 0 0 0 19 1.24 B/s 0.00% 0 **1,014** 0 show pfcwd stats QUEUE STATUS STORM DETECTED/RESTORED TX OK/DROP RX OK/DROP TX LAST OK/DROP RX LAST OK/DROP ------------ -------- ------------------------- ------------ ------------ ----------------- ----------------- Ethernet88:4 stormed 1/0 0/507 0/0 0/**507** 0/0

This is unrelated to the issue fixed by the PR. I will share some counters data that will explain the kind of failure this PR fixes.

StormLiangMS · 2025-03-14T06:06:23Z

hi @vivekverma-arista, could you check the SA failure?

mssonicbld · 2025-03-14T07:46:28Z

/azp run

azure-pipelines · 2025-03-14T07:46:38Z

Azure Pipelines successfully started running 1 pipeline(s).

vivekverma-arista · 2025-03-14T07:53:44Z

hi @vivekverma-arista, could you check the SA failure?

Done

StormLiangMS

lgtm

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

mssonicbld · 2025-03-20T02:47:45Z

Cherry-pick PR to 202411: #17619

StormLiangMS · 2025-03-20T02:48:12Z

merged.

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

nhe-NV · 2025-05-20T06:56:53Z

Hi @vivekverma-arista After the PR, the test will fail when the selected port is in portchannel and there is more than one port in the portchannel, I have opened a ticket, could help to check? #18496

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Co-authored-by: Vivek Verma <[email protected]>

…#17619) What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Co-authored-by: Vivek Verma <[email protected]>

Code sync sonic-net/sonic-mgmt:202411 => 202412 ``` * a1064ff (HEAD -> code-sync-202412, origin/code-sync-202412) r12f 250702:1620 - Merge remote-tracking branch 'base/202411' into code-sync-202412 |\ | * f98c8b2 (base/202411) jingwenxie 250513:1319 - Update logger to non user config table (sonic-net#18250) | * 7958657 Chun'ang Li 250702:1223 - manual cherry pick PR https://github.com/sonic-net/sonic-mgmt/pull/19116/files (sonic-net#19322) | * 14dda64 mssonicbld 250702:0533 - Fix flakiness in pfcwd/test_pfcwd_cli.py (sonic-net#17411) (sonic-net#17619) ```

What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Description of PR Summary: Fixes #714, #18496 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. signed-off-by: [email protected]

What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: opcoder0 <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: opcoder0 <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Aharon Malkin <[email protected]>

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: Guy Shemesh <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: Guy Shemesh <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Yael Tzur <[email protected]>

What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Fix flakiness in pfcwd/test_pfcwd_cli.py

83aa5d3

StormLiangMS requested a review from lipxu March 13, 2025 01:58

lipxu reviewed Mar 13, 2025

View reviewed changes

Fix pre-commit errors

3e4d9b9

lipxu approved these changes Mar 17, 2025

View reviewed changes

StormLiangMS approved these changes Mar 20, 2025

View reviewed changes

StormLiangMS merged commit 14c3ff2 into sonic-net:master Mar 20, 2025
13 checks passed

StormLiangMS added Request for 202411 branch Approved for 202411 branch labels Mar 20, 2025

mssonicbld added the Created PR to 202411 branch label Mar 20, 2025

mssonicbld mentioned this pull request Mar 20, 2025

[action] [PR:17411] Fix flakiness in pfcwd/test_pfcwd_cli.py #17619

Merged

11 tasks

nhe-NV mentioned this pull request May 20, 2025

Bug:The portchannel could not come up after the shutdown one of the port channel memeber. #18496

Closed

nhe-NV mentioned this pull request Jun 12, 2025

Skip pfcwd_cli test case by the github issue #18496 #18958

Closed

11 tasks

congh-nvidia mentioned this pull request Jun 27, 2025

[Bug] [dualtor-aa-64-breakout]: test_pfcwd_cli.py failed due to it choose upstream portchannel member as egress drop interface #17218

Closed

mssonicbld added the Included in 202411 branch label Jul 1, 2025

mssonicbld removed the Created PR to 202411 branch label Jul 1, 2025

This was referenced Jul 31, 2025

Fix pfcwd/test_pfcwd_cli.py for cEOS neighbors. #19968

Closed

Fix flakiness in pfcwd/test_pfcwd_cli.py #19969

Merged

This was referenced Aug 14, 2025

Fix test pfcwd cli 202505 #20246

Closed

Fix flakiness in pfcwd/test_pfcwd_cli.py #20247

Merged

Fix flakiness in pfcwd/test_pfcwd_cli.py #20248

Merged

echuawu mentioned this pull request Nov 27, 2025

Bug: test_pfcwd_cli.py case test_pfcwd_show_stat failure #21466

Open

Conversation

vivekverma-arista commented Mar 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Mar 7, 2025

Uh oh!

azure-pipelines bot commented Mar 7, 2025

Uh oh!

StormLiangMS commented Mar 13, 2025

Uh oh!

azure-pipelines bot commented Mar 13, 2025

Uh oh!

lipxu Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

vivekverma-arista Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

lipxu Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

vivekverma-arista Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

StormLiangMS commented Mar 14, 2025

Uh oh!

mssonicbld commented Mar 14, 2025

Uh oh!

azure-pipelines bot commented Mar 14, 2025

Uh oh!

vivekverma-arista commented Mar 14, 2025

Uh oh!

StormLiangMS left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mssonicbld commented Mar 20, 2025

Uh oh!

StormLiangMS commented Mar 20, 2025

Uh oh!

nhe-NV commented May 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vivekverma-arista commented Mar 7, 2025 •

edited

Loading

vivekverma-arista Mar 14, 2025 •

edited

Loading