Skip to content

[HLD] Using PFCWD to estimate PFC Historical stats#1903

Merged
vmittal-msft merged 1 commit intosonic-net:masterfrom
peterbailey-arista:master
Jun 18, 2025
Merged

[HLD] Using PFCWD to estimate PFC Historical stats#1903
vmittal-msft merged 1 commit intosonic-net:masterfrom
peterbailey-arista:master

Conversation

@peterbailey-arista
Copy link
Contributor

@peterbailey-arista peterbailey-arista commented Jan 24, 2025

Originally presented in SONiC Hackathon 2024 by Hetav Pandya @arista-hpandya and Kenneth Cheung @kenneth-arista

Using software to extrapolate PFC activity from existing PFC RX counters. The project has been adapted to work with the PFCWD.

The tracking issue:
#1904

The other individual PRs:

Yang Model Changes:
sonic-net/sonic-buildimage#21848 [merged]

SWSS Orchagent and lua script
sonic-net/sonic-swss#3533

Utilities pfcstat --history option
sonic-net/sonic-utilities#3778 [merged]

Utilities config/show
sonic-net/sonic-utilities#3779 [merged]

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.


A new orch agent named "PfcHistoryOrch" will configure a flex counter group ("PFC\_STAT\_HISTORY") with a lua script plugin to poll COUNTERS\_DB at the configured interval and investigate if Pause Frames have been received on the interface.

`TOTAL_PAUSE_TRANSITIONS`: Running total number of transitions from unpaused to paused
Copy link
Contributor

@lguohan lguohan Feb 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at this stats, looks like same as definition. https://github.com/sonic-net/sonic-swss/blob/master/orchagent/pfc_detect_broadcom.lua#L43

SAI_PORT_STAT_PFC_RX_PKTS
SAI_PORT_STAT_PFC_ON2OFF_RX_PKTS

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pfc watchdog logic, looks like we are already pulling these counters at ~200ms, so you might want to just reuse those counters instead of pulling your own.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback Guohan! I looked into your suggestion but think merging PFC history into the PFC watchdog would be impractical. Even though they both poll the same counters, the features are distinct and I believe should be individually configurable. Using the same flex counter would make this difficult.

For example, it should be possible to have history enabled while the watchdog is disabled and vice versa. Similarly it should be possible to enable them on different ports. It might additionally be misleading that configuring the polling interval for the watchdog would also correspond to the history feature.
Using the same Flex Counter by adding the pfc_stat_history.lua plugin to the watchdog (or by adjusting the pfc_detect_{platform}.lua script to estimate the historical data) would make these configurations significantly more cumbersome.

Looking at SAI_PORT_STAT_PFC_ON2OFF_RX_PKTS, it seems to be exclusive to broadcom platform and is not used or provided on other platforms, so it would only be usable in that situation regardless.

Please let me know what you think!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pfc watchdog logic, looks like we are already pulling these counters at ~200ms, so you might want to just reuse those counters instead of pulling your own.

Upon further review I altered the implementation to use the PFC watchdog as suggested. The HLD, tracking issue, and code PRs have all been updated to match this + community feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With above changes, does it mean history feature will go hand in hand with PFCWD ? Can you please share some examples of possible scenarios? As i asked in another comment, is this feature enabled by default or need to be manually configured ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's right, this now goes hand in hand with the PFCWD. Originally it was intended to be fully separate but the community feedback was to integrate the history estimation into the PFCWD's polling mechanism.
As a result, the feature is now enabled through the pfcwd cli and the logic runs in the pfcwd detect lua script.

By default this feature is disabled. it is configured on a per port basis using the pfcwd config cli.

A few scenarios might look like the following:

PFCWD started on a port but history disabled:
config pfcwd start Ethernet0 600

    PORT    ACTION    DETECTION TIME    RESTORATION TIME    HISTORY
---------  --------  ----------------  ------------------  ---------
Ethernet0      drop               600                1200    disable

PFCWD started on 2 ports and history enabled on only one port:
config pfcwd start Ethernet0 600
config pfcwd pfc_stat_history enable Ethernet0

    PORT    ACTION    DETECTION TIME    RESTORATION TIME    HISTORY
---------  --------  ----------------  ------------------  ---------
Ethernet0      drop               600                1200     enable
Ethernet4      drop               600                1200    disable

PFCWD started on all ports with history enabled
config pfcwd start all 600 --pfc-stat-history

    PORT    ACTION    DETECTION TIME    RESTORATION TIME    HISTORY
---------  --------  ----------------  ------------------  ---------
Ethernet0      drop               600                1200     enable
Ethernet4      drop               600                1200     enable

PFCWD NOT started on any ports but history enabled on a port:
config pfcwd pfc_stat_history enable Ethernet0

     PORT    ACTION    DETECTION TIME    RESTORATION TIME    HISTORY
---------  --------  ----------------  ------------------  ---------
Ethernet0      drop               N/A            infinite     enable

To be clear in this final scenario, the PFCWD polling/detect/restore will not be started on the port until a config pfcwd start Ethernet0 … is used. That means in addition, no history estimation will occur until the start command is run. But once it is started history will be enabled already.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explanation. I believe this is really useful feature. With PFC history available along with PFCWD data can help solve packet loss issues due to congestion in network. Wondering why don't we enable it by default? So that data is ready for us to analyze if an issue in the network rather than enabling this after the fact.

Copy link
Contributor Author

@peterbailey-arista peterbailey-arista Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reviewing! I replied to your question about enabling by default in my comment below. It can be done but I did have a couple concerns about doing so.

@peterbailey-arista peterbailey-arista marked this pull request as draft February 27, 2025 18:09
@zhangyanzhao
Copy link
Collaborator

@peterbailey-arista this PR has been reviewed by community on 2/25/2025. but the status is still in Draft. Can you please check? Thanks.

@zhangyanzhao
Copy link
Collaborator

@peterbailey-arista
Copy link
Contributor Author

@peterbailey-arista this PR has been reviewed by community on 2/25/2025. but the status is still in Draft. Can you please check? Thanks.

@zhangyanzhao Thank you for the comment, in response to community feedback during that meeting I have been updating the design and implementation to instead use the PFC Watchdog. I expect to have the updated HLD and PRs up for review soon

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@peterbailey-arista peterbailey-arista marked this pull request as ready for review April 8, 2025 20:16
@peterbailey-arista peterbailey-arista changed the title [HLD] orchagent for PFC Historical stats [HLD] Using PFCWD to estimate PFC Historical stats Apr 9, 2025
@peterbailey-arista
Copy link
Contributor Author

peterbailey-arista commented May 9, 2025

@peterbailey-arista this PR has been reviewed by community on 2/25/2025. but the status is still in Draft. Can you please check? Thanks.

@zhangyanzhao I am looking for reviews on my PRS as they have now passed the automated pipeline steps. Can you help me in finding reviewers for these?

The tracking issue:
#1904

The other individual PRs:

Yang Model Changes:
sonic-net/sonic-buildimage#21848 [merged]

SWSS Orchagent and lua script
sonic-net/sonic-swss#3533

Utilities pfcstat --history option
sonic-net/sonic-utilities#3778 [merged]

Utilities config/show
sonic-net/sonic-utilities#3779 [merged]

@vmittal-msft vmittal-msft self-requested a review May 14, 2025 17:19
@rlhui
Copy link
Contributor

rlhui commented May 14, 2025

please add the associated code PRs for this feature in the PR title section thanks.

@peterbailey-arista
Copy link
Contributor Author

Hi @vmittal-msft I'm just looking to re-request review, I am pushing for this feature in 202511 instead of 202505 now. Thanks!


The current implementation is for **Broadcom platforms only** but the descriptions provided in this document are intended to provide the details necessary to port this feature to other platforms.

When pfc stat history is enabled, approximate PFC historical statistics will be extrapolated by polling the existing SAI supported PFC stats stored in COUNTERS\_DB. These existing statistics are incremented and adjusted when pause frames are received at the granularity of priorities 0-7 for each configured interface on the switch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterbailey-arista What is default setting of this history configuration ? Is it enabled or disabled ? Also will history capture start by default or user need to turn on ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No the user needs to explicitly start the PFCWD on the port that they want history tracking, in addition to enabling history on that port. By default history is disabled even when the pfcwd monitoring has been started.

The order is irrelevant so there are 3 possible ways to do start historical estimations:

config pfcwd start <port> <detection time>
config pfcwd pfc_stat_history enable <port>

OR

config pfcwd pfc_stat_history enable <port>
config pfcwd start <port> <detection time>

OR

config pfcwd start <port> <detection time> --pfc-stat-history

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

`sonic-clear pfc` will now also cache the current total time and total number of transitions in the same way it did with the existing counters, meaning \--history will show the historical data since the last clear.

```
Port Priority RX Pause Transitions Total RX Pause Time US Recent RX Pause Time US Recent RX Pause Timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to maintain this information for non-lossless priorities? As we use only 3, 4 as lossless where pfc frames are expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand these queues are configurable they could (theoretically) be changed at some point?
In addition to that, the existing PFC SAI counters will increment across all priority queues as well and the PFCWD detection script also loops over all queues regardless of if we only consider 3 and 4 to be lossless.

For those reasons the feature is indiscriminate and tracks all 8 queues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Do you think if we enable this by default, will it cause high system overhead ? If so, that overhead can be reduced if we monitor lossless priority only. Please comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disabled this by default because I had a hard time measuring the performance overhead of the script additions since they are running on Redis and I didn't want to lock other platforms into having to use the feature by default. My change where history is estimated is in the PFC Detect Broadcom Lua script

You can also see in that same file, the script iterates over every queue provided by the flex counter which takes it out of my hands exactly what queues the watchdog is monitoring.

If you feel this should be enabled by default I can certainly do so, I would see the changes to my code being primarily in the CLI and along the lines of

  1. Changing the config pfcwd start default to insert pfc_stat_history="enable" instead of "disable"
  2. Change the --pfc-stat-history flag option to instead be a named argument that defaults to "enable", so to start pfcwd without history you would need to do config pfcwd start all --pfc-stat-history=disable

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. Lets keep disabled for now. But if you can help get some data about performance overhead that will be helpful so we can make a decision to enable this in future without affecting other platforms.

The PFCWD will have a new option to estimate PFC statistics using SAI-provided counters.
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@vmittal-msft
Copy link
Contributor

@abdosi This would be useful for Cisco devices as well.

@vmittal-msft vmittal-msft requested review from abdosi June 18, 2025 01:13
@vmittal-msft
Copy link
Contributor

@peterbailey-arista it looks good to me. Please answer few more queries and we can merge this.

@vmittal-msft vmittal-msft merged commit 5f67e1c into sonic-net:master Jun 18, 2025
1 check passed
prsunny pushed a commit to sonic-net/sonic-swss that referenced this pull request Aug 21, 2025
pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
a114j0y pushed a commit to a114j0y/sonic-swss that referenced this pull request Aug 28, 2025
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
a114j0y pushed a commit to a114j0y/sonic-swss that referenced this pull request Aug 29, 2025
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
balanokia pushed a commit to balanokia/sonic-swss that referenced this pull request Nov 17, 2025
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Feb 4, 2026
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904

Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
baorliu pushed a commit to baorliu/sonic-swss that referenced this pull request Feb 23, 2026
…#3533)

pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on.

This implementation was made in accordance with comments from the community HLD review
https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm

HLD: sonic-net/SONiC#1903

What I did
Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking.

Why I did it
Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904

Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants