Skip to content

bgpd: check rmac and nh of evpn imported routes#18372

Closed
dawkopagh wants to merge 1 commit intoFRRouting:masterfrom
dawkopagh:check_rmac_nh_for_evpn_imported_route
Closed

bgpd: check rmac and nh of evpn imported routes#18372
dawkopagh wants to merge 1 commit intoFRRouting:masterfrom
dawkopagh:check_rmac_nh_for_evpn_imported_route

Conversation

@dawkopagh
Copy link
Copy Markdown

Fix for an issue which has been described here: #18240
Basically in a setup where multiple peers advertise identical evpn type-5 rotues, if we clear session with any of them (and at least one is still up), we observe route withdrawals from kernel (even though we receive those routes from remaining peers). If the amount of routes is significant it causes a noticeable downtime before routes are readded to kernel.

I'd appreciate a feedback on why a route withdrawal is necessary in such case (per the comment above I can assume that rmac entry/nh might be lingering somewhere, but where? Change which introduced withdrawal has been merged long time ago, is it still necessary?)

Also menitoned behavior (where routes are being withdrawn immediately even if other peers advertise them) has started to occur after backpressure bgp zebra client #15524.

@donaldsharp
Copy link
Copy Markdown
Member

could you please add the verbiage you added to the PR to the actual commit message? I do not want to loose this data

Check rmac and nh of new bestpath for evpn imported prefix before withdrawal. Currently when
new bestpath is designated, evpn imported routes are being withdrawn from kernel causing
downtime which length depends on amount of routes to process.

Basically in a setup where multiple peers advertise identical evpn type-5 routes,
if we clear session with any of them (and at least one is still up), we observe
route withdrawals from kernel (even though we receive those routes from remaining peers).
If the amount of routes is significant it causes a noticeable downtime before routes are
readded to kernel.

Also menitoned behavior (where routes are being withdrawn immediately
even if other peers advertise them) has started to occur after
backpressure bgp zebra client FRRouting#15524

Let's check rmac entry and nh of new selected bestpath and do not actually withdraw them from kernel if
those two are the same. This fixes and issue where the same routes are being advertised by multiple
peers and we clear session with one of them FRRouting#18240

Signed-off-by: Dawid Kopec <[email protected]>
@dawkopagh dawkopagh force-pushed the check_rmac_nh_for_evpn_imported_route branch from 8833051 to 4865a82 Compare March 12, 2025 18:04
@ton31337
Copy link
Copy Markdown
Member

@Mergifyio backport stable/10.3 stable/10.2 stable/10.1

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 12, 2025

backport stable/10.3 stable/10.2 stable/10.1

🟠 Waiting for conditions to match

Details
  • merged [📌 backport requirement]

@ton31337
Copy link
Copy Markdown
Member

@raja-rajasekar opinion?

@raja-rajasekar
Copy link
Copy Markdown
Contributor

raja-rajasekar commented Mar 12, 2025

The withdraw actual shouldn't be at this point here(An action item I had in my plate for some time but never got a chance to do it), and https://github.com/FRRouting/frr/pull/18158/files addressed that issue (Not yet merged).

This is because we can always end up in a situation where we immediately withdraw the route while the install is pending for later (backpressure), thereby blackhole-ing the traffic.

So I am not sure why we need to patch this code, and rather maybe directly check with the fix in https://github.com/FRRouting/frr/pull/18158/files (Of course this does some other things as well which I haven't looked at it yet, however the commit which removes this block is what I am talking about)

@dawkopagh
Copy link
Copy Markdown
Author

Indeed #18158 is a better fix and would be ideal to have. I have tested my case with frr build from that PR and there is no downtime (which is expected as the snippet with route withdrawal has been deleted).

However there were concerns with breaking a standard pattern in zebra and there were no updates under the PR so I thought that fix might not eventually make it to master and thought of less invasive fix addressing the case where we have an update with old and new bestpaths with the same next-hop resulting in a downtime due to kernel add/del operations.

@github-actions
Copy link
Copy Markdown

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@dawkopagh
Copy link
Copy Markdown
Author

Closing this as #18158 has been merged and resolves the issue addressed by this PR

@dawkopagh dawkopagh closed this Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants