Fix flakiness in pfcwd/test_pfcwd_cli.py#17411
Fix flakiness in pfcwd/test_pfcwd_cli.py#17411StormLiangMS merged 2 commits intosonic-net:masterfrom
Conversation
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run Azure.sonic-mgmt |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
|
||
| class TestPfcwdFunc(SetupPfcwdFunc): | ||
| """ Test PFC function and supporting methods """ | ||
| def __shutdown_lag_members(self, duthost, selected_port): |
There was a problem hiding this comment.
Thanks for fixing this issue. But still not understand the situation clearly.
Do you mean, if the selected interface which was one of lag, and only this interface has the Storm, the drop count would be mismatch?
There was a problem hiding this comment.
Yes drop count will mismatch because some of the traffic won't even make it to the port being stormed, it will successfully egress through other LAG members.
There was a problem hiding this comment.
Thanks for your information, I checked one of the failure logs, the drop count from "show interfaces counters" and "show pfcwd stats" seems mismatch. but sure whether it is related to this issue.
Could you please share the test result? Will this fix make the case 100% pass? thanks
show interfaces counters:
IFACE STATE RX_OK RX_BPS RX_UTIL RX_ERR RX_DRP RX_OVR TX_OK TX_BPS TX_UTIL TX_ERR TX_DRP TX_OVR
----------- ------- ------- --------- --------- -------- -------- -------- ------- ---------- --------- -------- -------- --------
Ethernet88 U 11 14.38 MB/s 0.12% 0 0 0 19 1.24 B/s 0.00% 0 **1,014** 0
show pfcwd stats
QUEUE STATUS STORM DETECTED/RESTORED TX OK/DROP RX OK/DROP TX LAST OK/DROP RX LAST OK/DROP
------------ -------- ------------------------- ------------ ------------ ----------------- -----------------
Ethernet88:4 stormed 1/0 0/507 0/0 0/**507** 0/0
There was a problem hiding this comment.
This is unrelated to the issue fixed by the PR. I will share some counters data that will explain the kind of failure this PR fixes.
|
hi @vivekverma-arista, could you check the SA failure? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Done |
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
|
Cherry-pick PR to 202411: #17619 |
|
merged. |
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
|
Hi @vivekverma-arista After the PR, the test will fail when the selected port is in portchannel and there is more than one port in the portchannel, I have opened a ticket, could help to check? #18496 |
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Co-authored-by: Vivek Verma <[email protected]>
…#17619) What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Co-authored-by: Vivek Verma <[email protected]>
Code sync sonic-net/sonic-mgmt:202411 => 202412 ``` * a1064ff (HEAD -> code-sync-202412, origin/code-sync-202412) r12f 250702:1620 - Merge remote-tracking branch 'base/202411' into code-sync-202412 |\ | * f98c8b2 (base/202411) jingwenxie 250513:1319 - Update logger to non user config table (sonic-net#18250) | * 7958657 Chun'ang Li 250702:1223 - manual cherry pick PR https://github.com/sonic-net/sonic-mgmt/pull/19116/files (sonic-net#19322) | * 14dda64 mssonicbld 250702:0533 - Fix flakiness in pfcwd/test_pfcwd_cli.py (sonic-net#17411) (sonic-net#17619) ```
What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
Description of PR Summary: Fixes #714, #18496 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. signed-off-by: [email protected]
What is the motivation for this PR? Recent fix: #17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: opcoder0 <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: opcoder0 <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Aharon Malkin <[email protected]>
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: Guy Shemesh <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
What is the motivation for this PR? Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing. Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur. How did you do it? The proposed fix is to shutdown the remaining LAG members. How did you verify/test it? Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image. Signed-off-by: Guy Shemesh <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Guy Shemesh <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform. Signed-off-by: Yael Tzur <[email protected]>
What is the motivation for this PR? Recent fix: sonic-net#17411 The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members. This is being rectified in this change for cEOS neighbors. How did you do it? The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well. How did you verify/test it? Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.
Description of PR
Summary: Fix flakiness in pfcwd/test_pfcwd_cli.py
Fixes #383
Type of change
Back port request
Approach
What is the motivation for this PR?
Sometimes the test ends up picking an egress interface which happens to be a member of a LAG. If the LAG has multiple members and only one of them is stormed the drop/forwards expectations don't take into account lag hashing.
Some of the traffic is hashed to another LAG member which is not being stormed and no drops will occur.
How did you do it?
The proposed fix is to shutdown the remaining LAG members.
How did you verify/test it?
Tested on Arista-7260CX3 with dualtor-120 topology and 202411 image.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation