Conversation
5aabbcd to
69981f9
Compare
| ", holding off work", | ||
| __func__, prov->dp_name, curr); | ||
| counter = 0; | ||
| } else if (out_curr >= (uint64_t)limit) { |
There was a problem hiding this comment.
Can it be a situation where the input and output queue is hitting the limit?
There was a problem hiding this comment.
Yeah but it doesn't matter from my perspective. Both cases we do the same thing. It's sufficient to just state the first one encountered in a debug and move on.
| } | ||
| } | ||
|
|
||
|
|
There was a problem hiding this comment.
err. I intentionally added a newline to space things out a bit. So I am a bit confused
zebra/zebra_dplane.c
Outdated
| * Let's limit the work to how what can be put on the | ||
| * in or out queue without going over | ||
| */ | ||
| limit -= (curr > out_curr) ? curr : out_curr; |
There was a problem hiding this comment.
nit: limit -= max(curr, out_curr);?
zebra/zebra_dplane.c
Outdated
| * in or out queue without going over | ||
| */ | ||
| limit -= (curr > out_curr) ? curr : out_curr; | ||
| if ((uint32_t)limit != zdplane_info.dg_updates_per_cycle) |
There was a problem hiding this comment.
dg_updates_per_cycle is uint32, do we need a cast here?
There was a problem hiding this comment.
limit is not a uint32_t and as such it will complain about the size comparisons
| if (curr >= (uint64_t)limit) { | ||
| reschedule = true; | ||
| if (IS_ZEBRA_DEBUG_DPLANE_DETAIL) | ||
| zlog_debug("%s: Next Provider(%s) Input queue is %" PRIu64 |
There was a problem hiding this comment.
IMO, printing next_prov->dp_name would be useful here, otherwise hard to distinguish?
3261d90 to
490cfa2
Compare
…rns (#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
490cfa2 to
7bb5cb7
Compare
7bb5cb7 to
5eda763
Compare
The ctx queue data structures already have a counter associated with them. Let's just use them instead. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
Currently when the dplane_thread_loop is run, it moves contexts from the dg_update_list and puts the contexts on the input queue of the first provider. This provider is given a chance to run and then the items on the output queue are pulled off and placed on the input queue of the next provider. Rinse/Repeat down through the entire list of providers. Now imagine that we have a list of multiple providers and the last provider is getting backed up. Contexts will end up sticking in the input Queue of the `slow` provider. This can grow without bounds. This is a real problem when you have a situation where an interface is flapping and an upper level protocol is sending a continous stream of route updates to reflect the change in ecmp. You can end up with a very very large backlog of contexts. This is bad because zebra can easily grow to a very very large memory size and on restricted systems you can run out of memory. Fortunately for us, the MetaQ already participates with this process by not doing more route processing until the dg_update_list goes below the working limit of dg_updates_per_cycle. Thus if FRR modifies the behavior of this loop to not move more contexts onto the input queue if either the input queue or output queue of the next provider has reached this limit. FRR will naturaly start auto handling backpressure for the dplane context system and memory will not go out of control. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The dplane providers have a concept of input queues and output queues. These queues are chained together during normal operation. The code in zebra also has a feedback mechanism where the MetaQ will not run when the first input queue is backed up. Having the dplane_fpm_nl code grab all contexts when it is backed up prevents this system from behaving appropriately. Modify the code to not add to the dplane_fpm_nl's internal queue when it is already full. This will allow the backpressure to work appropriately in zebra proper. Signed-off-by: Donald Sharp <sharpd@nvidia.com>
The show zebra dplane provider command was ommitting the input and output queues to the dplane itself. It would be nice to have this insight as well. New output: r1# show zebra dplane providers dataplane Incoming Queue from Zebra: 100 Zebra dataplane providers: Kernel (1): in: 6, q: 0, q_max: 3, out: 6, q: 14, q_max: 3 dplane_fpm_nl (2): in: 6, q: 10, q_max: 3, out: 6, q: 0, q_max: 3 dataplane Outgoing Queue to Zebra: 43 r1# Signed-off-by: Donald Sharp <sharpd@nvidia.com>
5eda763 to
98b11de
Compare
mjstapp
left a comment
There was a problem hiding this comment.
Looks good to me now, thanks
…rns (sonic-net#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
…rns (#19717) Added the below patches which are part of BGP Zebra back pressure feature required to keep the memory usage in check during route churns How I did it New patches that were added: Patch FRR Pull request 0030-zebra-backpressure-Zebra-push-back-on-Buffer-Stream-.patch FRRouting/frr#15411 0031-bgpd-backpressure-Add-a-typesafe-list-for-Zebra-Anno.patch FRRouting/frr#15524 0032-bgpd-fix-flushing-ipv6-flowspec-entries-when-peering.patch FRRouting/frr#15326 0033-bgpd-backpressure-cleanup-bgp_zebra_XX-func-args.patch FRRouting/frr#15524 0034-gpd-backpressure-Handle-BGP-Zebra-Install-evt-Creat.patch FRRouting/frr#15524 0035-bgpd-backpressure-Handle-BGP-Zebra-EPVN-Install-evt-.patch FRRouting/frr#15624 0036-zebra-backpressure-Fix-Null-ptr-access-Coverity-Issu.patch FRRouting/frr#15728 0037-bgpd-Increase-install-uninstall-speed-of-evpn-vpn-vn.patch FRRouting/frr#15727 0038-zebra-Actually-display-I-O-buffer-sizes.patch FRRouting/frr#15708 0039-zebra-Actually-display-I-O-buffer-sizes-part-2.patch FRRouting/frr#15769 0040-bgpd-backpressure-Fix-to-withdraw-evpn-type-5-routes.patch FRRouting/frr#16034 0041-bgpd-backpressure-Fix-to-avoid-CPU-hog.patch FRRouting/frr#16035 0042-zebra-Use-built-in-data-structure-counter.patch FRRouting/frr#16221 0043-zebra-Use-the-ctx-queue-counters.patch FRRouting/frr#16220 0044-zebra-Modify-dplane-loop-to-allow-backpressure-to-fi.patch FRRouting/frr#16220 0045-zebra-Limit-queue-depth-in-dplane_fpm_nl.patch FRRouting/frr#16220 0046-zebra-Modify-show-zebra-dplane-providers-to-give-mor.patch FRRouting/frr#16220 0047-bgpd-backpressure-fix-evpn-route-sync-to-zebra.patch FRRouting/frr#16234 0048-bgpd-backpressure-fix-to-properly-remove-dest-for-bg.patch FRRouting/frr#16368 0049-bgpd-backpressure-Improve-debuggability.patch FRRouting/frr#16368 0050-bgpd-backpressure-Avoid-use-after-free.patch FRRouting/frr#16437 0051-bgpd-backpressure-fix-ret-value-evpn_route_select_in.patch FRRouting/frr#16416 0052-bgpd-backpressure-log-error-for-evpn-when-route-inst.patch FRRouting/frr#16416
Apply the following patch to dplane_fpm_sonic: * Zebra fpm backpressure (FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
…#21146) Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221) Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
<!--
Please make sure you've read and understood our contributing guidelines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md
** Make sure all your commits include a signature generated with `git commit -s` **
If this is a bug fix, make sure your description includes "fixes #xxxx", or
"closes #xxxx" or "resolves #xxxx"
Please provide the following information:
-->
#### Why I did it
Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously
#### How I did it
Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic:
* zebra: Use built in data structure counter (FRRouting/frr#16221)
* Zebra fpm backpressure (FRRouting/frr#16220)
<!--
#### How to verify it
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
<!--
#### Which release branch to backport (provide reason below if selected)
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305
-->
<!--
#### Tested branch (Please provide the tested image version)
- Please provide tested image version
- e.g.
- [x] 20201231.100
- [ ]
- [ ]
-->
<!--
#### Description for the changelog
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
<!--
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->
<!--
#### Link to config_db schema for YANG module changes
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
<!--
#### A picture of a cute animal (not mandatory but encouraged)
-->
<!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously #### How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: * zebra: Use built in data structure counter (FRRouting/frr#16221) * Zebra fpm backpressure (FRRouting/frr#16220) <!-- #### How to verify it If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> <!-- #### Which release branch to backport (provide reason below if selected) - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 --> <!-- #### Tested branch (Please provide the tested image version) - Please provide tested image version - e.g. - [x] 20201231.100 - [ ] - [ ] --> <!-- #### Description for the changelog Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> <!-- #### Link to config_db schema for YANG module changes Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> <!-- #### A picture of a cute animal (not mandatory but encouraged) -->
…sonic-net#21146) Why I did it Reduce high CPU usage on zebra after performing port toggle on all interfaces simultaneously How I did it Apply zebra fpm backpressure patches from FRR mainline to dplane_fpm_sonic: zebra: Use built in data structure counter (zebra: Use built in data structure counter FRRouting/frr#16221) Zebra fpm backpressure (Zebra fpm backpressure FRRouting/frr#16220) Signed-off-by: cscarpitta <cscarpit@cisco.com>
See individual commits