pfcwd: scale all-port-storm restore timeout by port count by lipxu · Pull Request #23301 · sonic-net/sonic-mgmt

lipxu · 2026-03-25T09:52:17Z

Scale the PFCwd restore phase timeout by port count (~2s per port) to prevent
false failures on high port-count platforms.

Description of PR

Root cause:

On platforms with many ports (e.g. Arista-7260CX3 with 92 ports in T0),
pfc_gen_brcm_xgs.py — the PFC storm generator running on the Arista EOS fanout —
processes fanout interfaces sequentially, delivering PFC frames to approximately
one port every 2 seconds. On a 92-port switch, the last fanout interface only starts
(or stops) receiving PFC frames ~184 seconds after the first.

This means the effective storm detection and restore windows are proportional to the
port count, not the pfcwd restore_time config. The hardcoded 60s timeout causes the
test to fail even though the DUT is functioning correctly — the storm simply hasn't
fully started or stopped on all ports yet.

Observed in Elastictest plan 69c0ad1a8bf1d6da18056056:

92/92 ports entered storm state (detection phase PASSES with generous timing)
Only 46/92 (50%) restored within 60s → test FAILS
After ~184s, all ports restore — DUT is healthy, the timeout was too short

Fix: timeout = max(60, num_ports * 2) — gives 184s for 92 ports.
LT2/FT2 floor of 120s is preserved via timeout = max(timeout, 120).

ADO tracking: https://msazure.visualstudio.com/One/_workitems/edit/37099434

Type of change

Bug fix

Back port request

202511

Approach

What is the motivation for this PR?

The restore timeout is hardcoded at 60s (120s for LT2/FT2). On Arista EOS fanouts,
pfc_gen_brcm_xgs.py delivers PFC frames to each fanout interface sequentially at
~2s per port. With 92 ports, the last port stops receiving PFC frames ~184s after
stop_storm() is called, so pfcwd cannot restore it until that point. The test was
always timing out before the last ports could restore.

How did you do it?

Compute num_ports from stormed_ports_list (restore phase) or selected_test_ports (storm phase)
Set timeout = max(60, num_ports * 2) before the LT2/FT2 check
Replace the LT2/FT2 check with timeout = max(timeout, 120) to preserve existing behavior
Updated code comment to explain the actual root cause (pfc_gen sequential processing)

How did you verify/test it?

Log analysis of Elastictest plan 69c45eb6664a140429da9224 (7260CX3, T0, 202511):

After stop_storm(), ports restored gradually over ~190s (27 restored at t=9s, 38 still
unreachable at t=57s, 1 remaining at t=190s)
Pattern matches sequential fanout processing: first port in cycle restores immediately,
last port (Ethernet254:3) takes the full ~184s cycle to complete
New timeout for 92 ports: max(60, 92×2) = 184s covers all but edge cases at cycle boundary

Any platform specific information?

Primarily affects high port-count platforms (Arista-7260CX3, etc.) running T0 topology
with Arista EOS fanouts using pfc_gen_brcm_xgs.py.

Supported testbed topology if it is a new test case?

Existing test — all topologies that run test_pfcwd_all_port_storm.

Documentation

N/A

mssonicbld · 2026-03-25T09:52:25Z

/azp run

azure-pipelines · 2026-03-25T09:52:41Z