Skip to content

[action] [PR:3391] Optimize counter polling interval by making it more accurate#3500

Merged
mssonicbld merged 1 commit intosonic-net:202411from
mssonicbld:cherry/202411/3391
Feb 7, 2025
Merged

[action] [PR:3391] Optimize counter polling interval by making it more accurate#3500
mssonicbld merged 1 commit intosonic-net:202411from
mssonicbld:cherry/202411/3391

Conversation

@mssonicbld
Copy link
Collaborator

What I did

Optimize the counter-polling performance in terms of polling interval accuracy

  1. Enable bulk counter-polling to run at a smaller chunk size
    There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group.
    An example is the competition between the PFC watchdog and the port counter groups.
    The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled quickly. Sometimes, PFC watchdog counters must wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time.
    To resolve this issue, we can reduce the chunk size of the port counter group. By default, the port counter group polls the counters of all ports in a single bulk operation. By using a smaller chunk size, it polls the counters in several bulk operations, with each polling counter of a subset (whose size = chunk size) of all ports. Furthermore, we support setting chunk size on a per-counter-ID basis.
    By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time.

  2. Collect the time stamp immediately after vendor SAI API returns.
    Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc.
    Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the current and last poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns.

Depends on

  1. Add field for bulk chunk size in flex counter sonic-swss-common#950
  2. Define bulk chunk size and bulk chunk size per counter ID sonic-sairedis#1519

Why I did it

How I verified it

Run regression test and observe counter-polling performance.

A comparison test shows very good results if we put any/or all of the above optimizations.

Details if related

For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects.
eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues.
Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps.
We use the name of a counter group to identify the time stamp of the counter group.
Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters

<!--
Please make sure you have read and understood the contribution guildlines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

1. Make sure your commit includes a signature generted with `git commit -s`
2. Make sure your commit title follows the correct format: [component]: description
3. Make sure your commit message contains enough details about the change and related tests
4. Make sure your pull request adds related reviewers, asignees, labels

Please also provide the following information in this pull request:
-->

**What I did**

Optimize the counter-polling performance in terms of polling interval accuracy

1. Enable bulk counter-polling to run at a smaller chunk size
   There is one counter-polling thread for each counter group. All such threads can compete for the critical sections at the vendor SAI level, which means a counter-polling thread can wait for a critical section if another thread has been in it, which introduces latency for the waiting counter group.
   An example is the competition between the PFC watchdog and the port counter groups.
   The port counter group contains many counters and is polled in a bulk mode which takes a relatively longer time. The PFC watchdog counter group contains only a few counters but is polled quickly. Sometimes, PFC watchdog counters must wait before polling, which makes the polling interval inaccurate and prevents the PFC storm from being detected in time.
   To resolve this issue, we can reduce the chunk size of the port counter group. By default, the port counter group polls the counters of all ports in a single bulk operation. By using a smaller chunk size, it polls the counters in several bulk operations, with each polling counter of a subset (whose size = `chunk size`) of all ports. Furthermore, we support setting chunk size on a per-counter-ID basis.
   By doing so, the port counter group stays in the critical section for a shorter time and the PFC watchdog is more likely to be scheduled to poll counters and detect the PFC storm in time.

2. Collect the time stamp immediately after vendor SAI API returns.
   Currently, many counter groups require a Lua plugin to execute based on polling interval, to calculate rates, detect certain events, etc.
   Eg. For PFC watchdog counter group to PFC storm. In this case, the polling interval is calculated based on the difference of time stamps between the `current` and `last` poll to avoid deviation due to scheduling latency. However, the timestamp is collected in the Lua plugin which is several steps after the SAI API returns and is executed in a different context (redis-server). Both introduce even larger deviations. To overcome this, we collect the timestamp immediately after the SAI API returns.

Depends on
1. sonic-net/sonic-swss-common#950
2. sonic-net/sonic-sairedis#1519

**Why I did it**

**How I verified it**

Run regression test and observe counter-polling performance.

A comparison test shows very good results if we put any/or all of the above optimizations.

**Details if related**

For 2, each counter group contains more than one counter context based on the type of objects. counter context is mapped from (group, object type). But the counters fetched from different counter groups will be pushed into the same entry for the same objects.
eg. PFC_WD group contains counters of ports and queues. PORT group contains counters of ports. QUEUE_STAT group contains counters of queues.
Both PFC_WD and PORT groups will push counter data into an item representing a port. but each counter has its own polling interval, which means counter IDs polled from different counter groups can be polled with different time stamps.
We use the name of a counter group to identify the time stamp of the counter group.
Eg. In port counter entry, PORT_timestamp represents last time when the port counter group polls the counters. PFC_WD_timestamp represents the last time when the PFC watchdog counter group polls the counters
@mssonicbld
Copy link
Collaborator Author

Original PR: #3391

@mssonicbld
Copy link
Collaborator Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit 337c9a1 into sonic-net:202411 Feb 7, 2025
4 of 7 checks passed
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* c939888e - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-08) [Sonic Automation]
* e967711 - (origin/202411) Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* 8230dd56 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-10) [Sonic Automation]
* e967711 - (origin/202411) Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* 25634cc1 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-11) [Sonic Automation]
* fe98176 - (origin/202411) Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* c93c0eec - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-12) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* 7532d469 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-13) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* 44417f65 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-14) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* aaf061fc - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-15) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* c97d84dd - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-16) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* f69aaaf1 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-17) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
dgsudharsan pushed a commit that referenced this pull request Feb 25, 2025
```<br>* 22d8d147 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-02-18) [Sonic Automation]
* 5031aad - (origin/202411) Capability query for MACSEC ACL attribute (#3511) (2025-02-12) [mssonicbld]
* 4b357e5 - Fix VRF update handling for loopback interfaces in IntfsOrch (#3512) (2025-02-12) [mssonicbld]
* fe98176 - Add a delay between killing teamd processes (#3510) (2025-02-11) [mssonicbld]
* e967711 - Remove RIF from m_rifsToAdd before deleting it (#3499) (2025-02-07) [mssonicbld]
* 337c9a1 - Optimize counter polling interval by making it more accurate (#3500) (2025-02-07) [mssonicbld]<br>```
kperumalbfn added a commit that referenced this pull request Apr 2, 2025
kperumalbfn added a commit that referenced this pull request Apr 2, 2025
DavidZagury pushed a commit to DavidZagury/sonic-swss that referenced this pull request Apr 28, 2025
```<br>* bbce4b4 - (HEAD -> 202412) Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412 (2025-04-14) [Sonic Automation]
* fceb196 - (origin/202411) Revert "Optimize counter polling interval by making it more accurate (2025-04-03) [Kumaresh Perumal]
* d88b694 - (origin/kperumal/check) Remove unused string variables in flexcounterorch.cpp (2025-04-02) [Kumaresh Perumal]
|\ 
| failure_prs.log skip_prs.log 1e4315e - (origin/kperumalbfn-patch-1) Remove unused string variables in flexcounterorch.cpp (2025-04-02) [Kumaresh Perumal]
|/ 
* e82f8c2 - Revert "Optimize counter polling interval by making it more accurate (sonic-net#3500)" (2025-04-02) [Kumaresh Perumal]<br>```
bradh352 pushed a commit to bradh352/sonic-swss that referenced this pull request May 3, 2025
liuh-80 pushed a commit to liuh-80/sonic-swss that referenced this pull request Jun 10, 2025
…into 202503 sonic-net#73

Merge branch '202412' of https://github.com/sonic-net/sonic-swss.msft into 202503

Merge from 202412.

9c96d25 Initialize the last fec ber computed values if not found (sonic-net#71)
b567ab5 Merge pull request sonic-net#70 from mssonicbld/sonicbld/202412-merge
19ed87b Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412
4383d39 Fix the missed port status notifications issue (sonic-net#3616)
4f6d557 Merge pull request sonic-net#69 from r12f/user/r12f/fix-build
fcbe392 Revert "Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412"
9e9442d [orchagent] Fix issue: typo in high BER (sonic-net#68)
7498ae9 Merge pull request sonic-net#67 from mssonicbld/sonicbld/202412-merge
bbce4b4 Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412
9e97862 Merge pull request sonic-net#66 from mssonicbld/sonicbld/202412-merge
fceb196  Revert "Optimize counter polling interval by making it more accurate
5206c2b Merge branch '202411' of https://github.com/sonic-net/sonic-swss into 202412
d88b694 Remove unused string variables in flexcounterorch.cpp
1e4315e Remove unused string variables in flexcounterorch.cpp
e82f8c2 Revert "Optimize counter polling interval by making it more accurate (sonic-net#3500)"
ca017d0 [vstest]: Fix MACsec test in the kernel 5.15 (sonic-net#3573)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant