[HLD] Using PFCWD to estimate PFC Historical stats#1903
[HLD] Using PFCWD to estimate PFC Historical stats#1903vmittal-msft merged 1 commit intosonic-net:masterfrom
Conversation
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
No pipelines are associated with this pull request. |
22db32b to
58f08f5
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
58f08f5 to
acca53d
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
acca53d to
8eadba7
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
|
|
||
| A new orch agent named "PfcHistoryOrch" will configure a flex counter group ("PFC\_STAT\_HISTORY") with a lua script plugin to poll COUNTERS\_DB at the configured interval and investigate if Pause Frames have been received on the interface. | ||
|
|
||
| `TOTAL_PAUSE_TRANSITIONS`: Running total number of transitions from unpaused to paused |
There was a problem hiding this comment.
take a look at this stats, looks like same as definition. https://github.com/sonic-net/sonic-swss/blob/master/orchagent/pfc_detect_broadcom.lua#L43
SAI_PORT_STAT_PFC_RX_PKTS
SAI_PORT_STAT_PFC_ON2OFF_RX_PKTS
There was a problem hiding this comment.
in pfc watchdog logic, looks like we are already pulling these counters at ~200ms, so you might want to just reuse those counters instead of pulling your own.
There was a problem hiding this comment.
Thank you for the feedback Guohan! I looked into your suggestion but think merging PFC history into the PFC watchdog would be impractical. Even though they both poll the same counters, the features are distinct and I believe should be individually configurable. Using the same flex counter would make this difficult.
For example, it should be possible to have history enabled while the watchdog is disabled and vice versa. Similarly it should be possible to enable them on different ports. It might additionally be misleading that configuring the polling interval for the watchdog would also correspond to the history feature.
Using the same Flex Counter by adding the pfc_stat_history.lua plugin to the watchdog (or by adjusting the pfc_detect_{platform}.lua script to estimate the historical data) would make these configurations significantly more cumbersome.
Looking at SAI_PORT_STAT_PFC_ON2OFF_RX_PKTS, it seems to be exclusive to broadcom platform and is not used or provided on other platforms, so it would only be usable in that situation regardless.
Please let me know what you think!
There was a problem hiding this comment.
in pfc watchdog logic, looks like we are already pulling these counters at ~200ms, so you might want to just reuse those counters instead of pulling your own.
Upon further review I altered the implementation to use the PFC watchdog as suggested. The HLD, tracking issue, and code PRs have all been updated to match this + community feedback.
There was a problem hiding this comment.
With above changes, does it mean history feature will go hand in hand with PFCWD ? Can you please share some examples of possible scenarios? As i asked in another comment, is this feature enabled by default or need to be manually configured ?
There was a problem hiding this comment.
Yes that's right, this now goes hand in hand with the PFCWD. Originally it was intended to be fully separate but the community feedback was to integrate the history estimation into the PFCWD's polling mechanism.
As a result, the feature is now enabled through the pfcwd cli and the logic runs in the pfcwd detect lua script.
By default this feature is disabled. it is configured on a per port basis using the pfcwd config cli.
A few scenarios might look like the following:
PFCWD started on a port but history disabled:
config pfcwd start Ethernet0 600
PORT ACTION DETECTION TIME RESTORATION TIME HISTORY
--------- -------- ---------------- ------------------ ---------
Ethernet0 drop 600 1200 disable
PFCWD started on 2 ports and history enabled on only one port:
config pfcwd start Ethernet0 600
config pfcwd pfc_stat_history enable Ethernet0
PORT ACTION DETECTION TIME RESTORATION TIME HISTORY
--------- -------- ---------------- ------------------ ---------
Ethernet0 drop 600 1200 enable
Ethernet4 drop 600 1200 disable
PFCWD started on all ports with history enabled
config pfcwd start all 600 --pfc-stat-history
PORT ACTION DETECTION TIME RESTORATION TIME HISTORY
--------- -------- ---------------- ------------------ ---------
Ethernet0 drop 600 1200 enable
Ethernet4 drop 600 1200 enable
PFCWD NOT started on any ports but history enabled on a port:
config pfcwd pfc_stat_history enable Ethernet0
PORT ACTION DETECTION TIME RESTORATION TIME HISTORY
--------- -------- ---------------- ------------------ ---------
Ethernet0 drop N/A infinite enable
To be clear in this final scenario, the PFCWD polling/detect/restore will not be started on the port until a config pfcwd start Ethernet0 … is used. That means in addition, no history estimation will occur until the start command is run. But once it is started history will be enabled already.
There was a problem hiding this comment.
Thanks for explanation. I believe this is really useful feature. With PFC history available along with PFCWD data can help solve packet loss issues due to congestion in network. Wondering why don't we enable it by default? So that data is ready for us to analyze if an issue in the network rather than enabling this after the fact.
There was a problem hiding this comment.
Thank you for reviewing! I replied to your question about enabling by default in my comment below. It can be done but I did have a couple concerns about doing so.
|
@peterbailey-arista this PR has been reviewed by community on 2/25/2025. but the status is still in Draft. Can you please check? Thanks. |
|
Community review recording https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm |
@zhangyanzhao Thank you for the comment, in response to community feedback during that meeting I have been updating the design and implementation to instead use the PFC Watchdog. I expect to have the updated HLD and PRs up for review soon |
8eadba7 to
817837a
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
@zhangyanzhao I am looking for reviews on my PRS as they have now passed the automated pipeline steps. Can you help me in finding reviewers for these? The tracking issue: The other individual PRs: Yang Model Changes: SWSS Orchagent and lua script Utilities pfcstat --history option Utilities config/show |
|
please add the associated code PRs for this feature in the PR title section thanks. |
|
Hi @vmittal-msft I'm just looking to re-request review, I am pushing for this feature in 202511 instead of 202505 now. Thanks! |
|
|
||
| The current implementation is for **Broadcom platforms only** but the descriptions provided in this document are intended to provide the details necessary to port this feature to other platforms. | ||
|
|
||
| When pfc stat history is enabled, approximate PFC historical statistics will be extrapolated by polling the existing SAI supported PFC stats stored in COUNTERS\_DB. These existing statistics are incremented and adjusted when pause frames are received at the granularity of priorities 0-7 for each configured interface on the switch. |
There was a problem hiding this comment.
@peterbailey-arista What is default setting of this history configuration ? Is it enabled or disabled ? Also will history capture start by default or user need to turn on ?
There was a problem hiding this comment.
No the user needs to explicitly start the PFCWD on the port that they want history tracking, in addition to enabling history on that port. By default history is disabled even when the pfcwd monitoring has been started.
The order is irrelevant so there are 3 possible ways to do start historical estimations:
config pfcwd start <port> <detection time>
config pfcwd pfc_stat_history enable <port>
OR
config pfcwd pfc_stat_history enable <port>
config pfcwd start <port> <detection time>
OR
config pfcwd start <port> <detection time> --pfc-stat-history
| `sonic-clear pfc` will now also cache the current total time and total number of transitions in the same way it did with the existing counters, meaning \--history will show the historical data since the last clear. | ||
|
|
||
| ``` | ||
| Port Priority RX Pause Transitions Total RX Pause Time US Recent RX Pause Time US Recent RX Pause Timestamp |
There was a problem hiding this comment.
Do we really need to maintain this information for non-lossless priorities? As we use only 3, 4 as lossless where pfc frames are expected.
There was a problem hiding this comment.
As far as I understand these queues are configurable they could (theoretically) be changed at some point?
In addition to that, the existing PFC SAI counters will increment across all priority queues as well and the PFCWD detection script also loops over all queues regardless of if we only consider 3 and 4 to be lossless.
For those reasons the feature is indiscriminate and tracks all 8 queues.
There was a problem hiding this comment.
Ok. Do you think if we enable this by default, will it cause high system overhead ? If so, that overhead can be reduced if we monitor lossless priority only. Please comment.
There was a problem hiding this comment.
I disabled this by default because I had a hard time measuring the performance overhead of the script additions since they are running on Redis and I didn't want to lock other platforms into having to use the feature by default. My change where history is estimated is in the PFC Detect Broadcom Lua script
You can also see in that same file, the script iterates over every queue provided by the flex counter which takes it out of my hands exactly what queues the watchdog is monitoring.
If you feel this should be enabled by default I can certainly do so, I would see the changes to my code being primarily in the CLI and along the lines of
- Changing the
config pfcwd start defaultto insert pfc_stat_history="enable" instead of "disable" - Change the --pfc-stat-history flag option to instead be a named argument that defaults to "enable", so to start pfcwd without history you would need to do
config pfcwd start all --pfc-stat-history=disable
There was a problem hiding this comment.
Make sense. Lets keep disabled for now. But if you can help get some data about performance overhead that will be helpful so we can make a decision to enable this in future without affecting other platforms.
The PFCWD will have a new option to estimate PFC statistics using SAI-provided counters.
817837a to
b99f026
Compare
|
/azp run |
|
No pipelines are associated with this pull request. |
|
@abdosi This would be useful for Cisco devices as well. |
|
@peterbailey-arista it looks good to me. Please answer few more queries and we can merge this. |
pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904 Signed-off-by: Lawrence Lee <lawlee@microsoft.com>
…#3533) pfcwdorch has been updated to allow pfc historical statistic estimation. The orch allows the feature to be enabled or disabled and the pfc_detect_broadcom.lua script will perform the estimation and update counters_db for ports that tracking is enabled on. This implementation was made in accordance with comments from the community HLD review https://zoom.us/rec/share/jWkZRs51QULsDQvrSlm-5qC4OZ3gkG6RVdB2k_vqwgLcnezsaG1XtX5Aqk6xBYCZ.YMhHymrP4jbpkLbm HLD: sonic-net/SONiC#1903 What I did Edited the pfcwdorch and the pfc_detect_broadcom.lua to allow enable/disable-able PFC statistical history tracking. Why I did it Part of the PFC Historical Statistics Feature: sonic-net/SONiC#1904 Signed-off-by: Baorong Liu <96146196+baorliu@users.noreply.github.com>
Originally presented in SONiC Hackathon 2024 by Hetav Pandya @arista-hpandya and Kenneth Cheung @kenneth-arista
Using software to extrapolate PFC activity from existing PFC RX counters. The project has been adapted to work with the PFCWD.
The tracking issue:
#1904
The other individual PRs:
Yang Model Changes:
sonic-net/sonic-buildimage#21848[merged]SWSS Orchagent and lua script
sonic-net/sonic-swss#3533
Utilities pfcstat --history option
sonic-net/sonic-utilities#3778[merged]Utilities config/show
sonic-net/sonic-utilities#3779[merged]