Skip to content

pfcwd: scale all-port-storm restore timeout by port count#23301

Merged
lolyu merged 1 commit intosonic-net:masterfrom
lipxu:fix/pfcwd-restore-timeout-scaling
Mar 27, 2026
Merged

pfcwd: scale all-port-storm restore timeout by port count#23301
lolyu merged 1 commit intosonic-net:masterfrom
lipxu:fix/pfcwd-restore-timeout-scaling

Conversation

@lipxu
Copy link
Contributor

@lipxu lipxu commented Mar 25, 2026

Scale the PFCwd restore phase timeout by port count (~2s per port) to prevent
false failures on high port-count platforms.

Description of PR

Root cause:

On platforms with many ports (e.g. Arista-7260CX3 with 92 ports in T0),
pfc_gen_brcm_xgs.py — the PFC storm generator running on the Arista EOS fanout —
processes fanout interfaces sequentially, delivering PFC frames to approximately
one port every 2 seconds. On a 92-port switch, the last fanout interface only starts
(or stops) receiving PFC frames ~184 seconds after the first.

This means the effective storm detection and restore windows are proportional to the
port count, not the pfcwd restore_time config. The hardcoded 60s timeout causes the
test to fail even though the DUT is functioning correctly — the storm simply hasn't
fully started or stopped on all ports yet.

Observed in Elastictest plan 69c0ad1a8bf1d6da18056056:

  • 92/92 ports entered storm state (detection phase PASSES with generous timing)
  • Only 46/92 (50%) restored within 60s → test FAILS
  • After ~184s, all ports restore — DUT is healthy, the timeout was too short

Fix: timeout = max(60, num_ports * 2) — gives 184s for 92 ports.
LT2/FT2 floor of 120s is preserved via timeout = max(timeout, 120).

ADO tracking: https://msazure.visualstudio.com/One/_workitems/edit/37099434

Type of change

  • Bug fix

Back port request

  • 202511

Approach

What is the motivation for this PR?

The restore timeout is hardcoded at 60s (120s for LT2/FT2). On Arista EOS fanouts,
pfc_gen_brcm_xgs.py delivers PFC frames to each fanout interface sequentially at
~2s per port. With 92 ports, the last port stops receiving PFC frames ~184s after
stop_storm() is called, so pfcwd cannot restore it until that point. The test was
always timing out before the last ports could restore.

How did you do it?

  • Compute num_ports from stormed_ports_list (restore phase) or selected_test_ports (storm phase)
  • Set timeout = max(60, num_ports * 2) before the LT2/FT2 check
  • Replace the LT2/FT2 check with timeout = max(timeout, 120) to preserve existing behavior
  • Updated code comment to explain the actual root cause (pfc_gen sequential processing)

How did you verify/test it?

Log analysis of Elastictest plan 69c45eb6664a140429da9224 (7260CX3, T0, 202511):

  • After stop_storm(), ports restored gradually over ~190s (27 restored at t=9s, 38 still
    unreachable at t=57s, 1 remaining at t=190s)
  • Pattern matches sequential fanout processing: first port in cycle restores immediately,
    last port (Ethernet254:3) takes the full ~184s cycle to complete
  • New timeout for 92 ports: max(60, 92×2) = 184s covers all but edge cases at cycle boundary

Any platform specific information?

Primarily affects high port-count platforms (Arista-7260CX3, etc.) running T0 topology
with Arista EOS fanouts using pfc_gen_brcm_xgs.py.

Supported testbed topology if it is a new test case?

Existing test — all topologies that run test_pfcwd_all_port_storm.

Documentation

N/A

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu force-pushed the fix/pfcwd-restore-timeout-scaling branch from 807cf1e to f9edf98 Compare March 25, 2026 09:55
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu force-pushed the fix/pfcwd-restore-timeout-scaling branch from f9edf98 to ae0e579 Compare March 25, 2026 11:02
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

yxieca
yxieca previously approved these changes Mar 26, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI agent on behalf of Ying. Reviewed; no issues found.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu force-pushed the fix/pfcwd-restore-timeout-scaling branch from 68dd0d4 to c80e307 Compare March 26, 2026 05:03
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu force-pushed the fix/pfcwd-restore-timeout-scaling branch from c80e307 to b2a69bd Compare March 26, 2026 06:19
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu enabled auto-merge (squash) March 27, 2026 01:08
Copy link
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lolyu lolyu merged commit 3cd82dc into sonic-net:master Mar 27, 2026
18 of 19 checks passed
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
Approach
What is the motivation for this PR?
The restore timeout is hardcoded at 60s (120s for LT2/FT2). On Arista EOS fanouts,
pfc_gen_brcm_xgs.py delivers PFC frames to each fanout interface sequentially at
~2s per port. With 92 ports, the last port stops receiving PFC frames ~184s after
stop_storm() is called, so pfcwd cannot restore it until that point. The test was
always timing out before the last ports could restore.

How did you do it?
Compute num_ports from stormed_ports_list (restore phase) or selected_test_ports (storm phase)
Set timeout = max(60, num_ports * 2) before the LT2/FT2 check
Replace the LT2/FT2 check with timeout = max(timeout, 120) to preserve existing behavior
Updated code comment to explain the actual root cause (pfc_gen sequential processing)
How did you verify/test it?
Log analysis of Elastictest plan 69c45eb6664a140429da9224 (7260CX3, T0, 202511):

After stop_storm(), ports restored gradually over ~190s (27 restored at t=9s, 38 still
unreachable at t=57s, 1 remaining at t=190s)
Pattern matches sequential fanout processing: first port in cycle restores immediately,
last port (Ethernet254:3) takes the full ~184s cycle to complete
New timeout for 92 ports: max(60, 92×2) = 184s covers all but edge cases at cycle boundary
Any platform specific information?
Primarily affects high port-count platforms (Arista-7260CX3, etc.) running T0 topology
with Arista EOS fanouts using pfc_gen_brcm_xgs.py.

Supported testbed topology if it is a new test case?
Existing test — all topologies that run test_pfcwd_all_port_storm.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants