Skip to content

Fix flakiness in pfcwd/test_pfcwd_cli.py#17411

Merged
StormLiangMS merged 2 commits intosonic-net:masterfrom
vivekverma-arista:fix-pfcwd-cli
Mar 20, 2025
Merged

Fix flakiness in pfcwd/test_pfcwd_cli.py#17411
StormLiangMS merged 2 commits intosonic-net:masterfrom
vivekverma-arista:fix-pfcwd-cli

Conversation

@vivekverma-arista
Copy link
Copy Markdown
Contributor

@vivekverma-arista vivekverma-arista commented Mar 7, 2025

Description of PR

Summary: Fix flakiness in pfcwd/test_pfcwd_cli.py
Fixes #383

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202012
  • 202205
  • 202305
  • 202311
  • 202405
  • 202411

Approach

What is the motivation for this PR?

Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?

The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?

Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@StormLiangMS StormLiangMS requested a review from lipxu March 13, 2025 01:58
@StormLiangMS
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-mgmt

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).


class TestPfcwdFunc(SetupPfcwdFunc):
""" Test PFC function and supporting methods """
def __shutdown_lag_members(self, duthost, selected_port):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this issue. But still not understand the situation clearly.
Do you mean, if the selected interface which was one of lag, and only this interface has the Storm, the drop count would be mismatch?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes drop count will mismatch because some of the traffic won't even make it to the port being stormed, it will successfully egress through other LAG members.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your information, I checked one of the failure logs, the drop count from "show interfaces counters" and "show pfcwd stats" seems mismatch. but sure whether it is related to this issue.
Could you please share the test result? Will this fix make the case 100% pass? thanks

show interfaces counters:
      IFACE    STATE    RX_OK     RX_BPS    RX_UTIL    RX_ERR    RX_DRP    RX_OVR    TX_OK      TX_BPS    TX_UTIL    TX_ERR    TX_DRP    TX_OVR
-----------  -------  -------  ---------  ---------  --------  --------  --------  -------  ----------  ---------  --------  --------  --------
 Ethernet88        U       11  14.38 MB/s      0.12%         0         0         0       19    1.24 B/s      0.00%         0     **1,014**         0


show pfcwd stats
       QUEUE    STATUS    STORM DETECTED/RESTORED    TX OK/DROP    RX OK/DROP    TX LAST OK/DROP    RX LAST OK/DROP
------------  --------  -------------------------  ------------  ------------  -----------------  -----------------
Ethernet88:4   stormed                        1/0         0/507           0/0              0/**507**                0/0

Copy link
Copy Markdown
Contributor Author

@vivekverma-arista vivekverma-arista Mar 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unrelated to the issue fixed by the PR. I will share some counters data that will explain the kind of failure this PR fixes.

@StormLiangMS
Copy link
Copy Markdown
Collaborator

hi @vivekverma-arista, could you check the SA failure?

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@vivekverma-arista
Copy link
Copy Markdown
Contributor Author

hi @vivekverma-arista, could you check the SA failure?

Done

Copy link
Copy Markdown
Collaborator

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@StormLiangMS StormLiangMS merged commit 14c3ff2 into sonic-net:master Mar 20, 2025
13 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 20, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202411: #17619

@StormLiangMS
Copy link
Copy Markdown
Collaborator

merged.

amulyan7 pushed a commit to amulyan7/sonic-mgmt that referenced this pull request Mar 31, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
OriTrabelsi pushed a commit to OriTrabelsi/sonic-mgmt that referenced this pull request Apr 1, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
@nhe-NV
Copy link
Copy Markdown
Contributor

nhe-NV commented May 20, 2025

Hi @vivekverma-arista After the PR, the test will fail when the selected port is in portchannel and there is more than one port in the portchannel, I have opened a ticket, could help to check? #18496

yxieca pushed a commit that referenced this pull request Jul 1, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Co-authored-by: Vivek Verma <[email protected]>
nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Jul 1, 2025
…#17619)

What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Co-authored-by: Vivek Verma <[email protected]>
sdszhang pushed a commit to sdszhang/sonic-mgmt that referenced this pull request Aug 2, 2025
Code sync sonic-net/sonic-mgmt:202411 => 202412

```
*   a1064ff (HEAD -> code-sync-202412, origin/code-sync-202412) r12f 250702:1620 - Merge remote-tracking branch 'base/202411' into code-sync-202412
|\
| * f98c8b2 (base/202411) jingwenxie 250513:1319 - Update logger to non user config table (sonic-net#18250)
| * 7958657 Chun'ang Li 250702:1223 - manual cherry pick PR https://github.com/sonic-net/sonic-mgmt/pull/19116/files (sonic-net#19322)
| * 14dda64 mssonicbld 250702:0533 - Fix flakiness in pfcwd/test_pfcwd_cli.py (sonic-net#17411) (sonic-net#17619)
```
StormLiangMS pushed a commit that referenced this pull request Aug 12, 2025
What is the motivation for this PR?
Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
ashutosh-agrawal pushed a commit to ashutosh-agrawal/sonic-mgmt that referenced this pull request Aug 14, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
yejianquan pushed a commit that referenced this pull request Aug 20, 2025
Description of PR
Summary:
Fixes #714, #18496

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

Approach
What is the motivation for this PR?
Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

signed-off-by: [email protected]
StormLiangMS pushed a commit that referenced this pull request Aug 28, 2025
What is the motivation for this PR?
Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
vidyac86 pushed a commit to vidyac86/sonic-mgmt that referenced this pull request Oct 23, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Signed-off-by: opcoder0 <[email protected]>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: opcoder0 <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 16, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <[email protected]>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Aharon Malkin <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Signed-off-by: Guy Shemesh <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <[email protected]>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.

Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.

How did you do it?
The proposed fix is to shutdown the remaining LAG members.

How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.

Signed-off-by: Guy Shemesh <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Guy Shemesh <[email protected]>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Signed-off-by: Yael Tzur <[email protected]>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 27, 2026
What is the motivation for this PR?
Recent fix: sonic-net#17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?
The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?
Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants