pfcwd: scale all-port-storm restore timeout by port count#23301
Merged
lolyu merged 1 commit intosonic-net:masterfrom Mar 27, 2026
Merged
pfcwd: scale all-port-storm restore timeout by port count#23301lolyu merged 1 commit intosonic-net:masterfrom
lolyu merged 1 commit intosonic-net:masterfrom
Conversation
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
807cf1e to
f9edf98
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
f9edf98 to
ae0e579
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
previously approved these changes
Mar 26, 2026
Collaborator
yxieca
left a comment
There was a problem hiding this comment.
AI agent on behalf of Ying. Reviewed; no issues found.
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
68dd0d4 to
c80e307
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Liping Xu <[email protected]> Co-authored-by: Copilot <[email protected]>
c80e307 to
b2a69bd
Compare
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
ravaliyel
pushed a commit
to ravaliyel/sonic-mgmt
that referenced
this pull request
Mar 27, 2026
Approach What is the motivation for this PR? The restore timeout is hardcoded at 60s (120s for LT2/FT2). On Arista EOS fanouts, pfc_gen_brcm_xgs.py delivers PFC frames to each fanout interface sequentially at ~2s per port. With 92 ports, the last port stops receiving PFC frames ~184s after stop_storm() is called, so pfcwd cannot restore it until that point. The test was always timing out before the last ports could restore. How did you do it? Compute num_ports from stormed_ports_list (restore phase) or selected_test_ports (storm phase) Set timeout = max(60, num_ports * 2) before the LT2/FT2 check Replace the LT2/FT2 check with timeout = max(timeout, 120) to preserve existing behavior Updated code comment to explain the actual root cause (pfc_gen sequential processing) How did you verify/test it? Log analysis of Elastictest plan 69c45eb6664a140429da9224 (7260CX3, T0, 202511): After stop_storm(), ports restored gradually over ~190s (27 restored at t=9s, 38 still unreachable at t=57s, 1 remaining at t=190s) Pattern matches sequential fanout processing: first port in cycle restores immediately, last port (Ethernet254:3) takes the full ~184s cycle to complete New timeout for 92 ports: max(60, 92×2) = 184s covers all but edge cases at cycle boundary Any platform specific information? Primarily affects high port-count platforms (Arista-7260CX3, etc.) running T0 topology with Arista EOS fanouts using pfc_gen_brcm_xgs.py. Supported testbed topology if it is a new test case? Existing test — all topologies that run test_pfcwd_all_port_storm.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scale the PFCwd restore phase timeout by port count (~2s per port) to prevent
false failures on high port-count platforms.
Description of PR
Root cause:
On platforms with many ports (e.g. Arista-7260CX3 with 92 ports in T0),
pfc_gen_brcm_xgs.py— the PFC storm generator running on the Arista EOS fanout —processes fanout interfaces sequentially, delivering PFC frames to approximately
one port every 2 seconds. On a 92-port switch, the last fanout interface only starts
(or stops) receiving PFC frames ~184 seconds after the first.
This means the effective storm detection and restore windows are proportional to the
port count, not the pfcwd
restore_timeconfig. The hardcoded 60s timeout causes thetest to fail even though the DUT is functioning correctly — the storm simply hasn't
fully started or stopped on all ports yet.
Observed in Elastictest plan
69c0ad1a8bf1d6da18056056:Fix:
timeout = max(60, num_ports * 2)— gives 184s for 92 ports.LT2/FT2 floor of 120s is preserved via
timeout = max(timeout, 120).ADO tracking: https://msazure.visualstudio.com/One/_workitems/edit/37099434
Type of change
Back port request
Approach
What is the motivation for this PR?
The restore timeout is hardcoded at 60s (120s for LT2/FT2). On Arista EOS fanouts,
pfc_gen_brcm_xgs.pydelivers PFC frames to each fanout interface sequentially at~2s per port. With 92 ports, the last port stops receiving PFC frames ~184s after
stop_storm()is called, so pfcwd cannot restore it until that point. The test wasalways timing out before the last ports could restore.
How did you do it?
num_portsfromstormed_ports_list(restore phase) orselected_test_ports(storm phase)timeout = max(60, num_ports * 2)before the LT2/FT2 checktimeout = max(timeout, 120)to preserve existing behaviorHow did you verify/test it?
Log analysis of Elastictest plan
69c45eb6664a140429da9224(7260CX3, T0, 202511):stop_storm(), ports restored gradually over ~190s (27 restored at t=9s, 38 stillunreachable at t=57s, 1 remaining at t=190s)
last port (Ethernet254:3) takes the full ~184s cycle to complete
max(60, 92×2) = 184scovers all but edge cases at cycle boundaryAny platform specific information?
Primarily affects high port-count platforms (Arista-7260CX3, etc.) running T0 topology
with Arista EOS fanouts using
pfc_gen_brcm_xgs.py.Supported testbed topology if it is a new test case?
Existing test — all topologies that run test_pfcwd_all_port_storm.
Documentation
N/A