[FRR] send EOR during GR only when fib install complete#24174
[FRR] send EOR during GR only when fib install complete#24174stepanblyschak wants to merge 5 commits intosonic-net:masterfrom
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @stepanblyschak @dgsudharsan I'm trying to assess the impact of this issue. We typically apply FIB suppression on T1 devices when they're in GR (Graceful Restart) and covered. After that, we send the RIB to its peer, but send EOR before transmitting all the route information. This can result in some routes being deleted and re-added, which might cause an unnecessary partial route flap on the peer device, but it shouldn't cause a black hole, correct? Also, there's a test case (test_bgp_gr_helper) that seems to have failed to catch this issue. Should we consider this a test gap? Lastly, there appears to be a conflict with the PR. Could you help address it? |
|
@StormLiangMS @volodymyrsamotiy to help resolve the conflict. |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@StormLiangMS Issue is only with T0 device (T1 is not Note, that due to a delay in PortChannel IP configuration, the upstream BGP sessions are established with significant delay. Yes, it's a test gap in sonic-mgmt (even with sonic-net/sonic-swss#3782 all sonic-mgmt we run tests pass). It was found by another test. |
hi @stepanblyschak Do we observe this issue in the T1 use case? I don't believe we've ever enabled FIB suppress on T0. We could merge this fix into the master branch, but I'd prefer to keep it out of 202505 unless it has an impact on real scenario. Could you also open a GitHub issue to track this test gap? |
|
@StormLiangMS we don't observe issues with t1 use case, ok not taking to 202505 |
|
Should this FRR change be committed to FRR 10.3, 10.4 release branch first ? Since currently 202505 uses FRR 10.3 as FRR base and this patch has none trivial changes in bgp, I would seriously concern about the future maintainability if this commit is in SONiC's 10.3, but not in FRR's 10.3. For 202511, we are rebasing to 10.4. Again, we can't take this change until this commit is in FRR stable/10.4. That would help us to maintain long term maintainability for release patching. |
eddieruan-alibaba
left a comment
There was a problem hiding this comment.
Need to make sure it is committed to FRR stable/10.3, stable/10.4 first, before adding to SONiC patch list.
i believe this should be taken to the FRR community and not to SONiC. once it is there, then the owner of the SONIC FRR maintainer will do that. right? @StormLiangMS do we have an approval to make it part of master/202511? as branch is out soon lets try to close it this week please. |
|
I'm ok to take this in 202511, but would like to have a testgap to address this. @stepanblyschak could you help to file one? We can ask community to help. To have this patch, what Eddie suggested is to have this committed to stable/10.4 FRR, @stepanblyschak could you help to confirm if upgrade to stable/10.4 FRR, do we still need this patch? If so, could we ask FRR community to cherrypick this patch first before adding to SONiC? @dgsudharsan @eddieruan-alibaba @liat-grozovik FYI. |
|
@StormLiangMS Test gap issue - sonic-net/sonic-mgmt#21249 |
|
Thanks @stepanblyschak would you think we can have this in stable/10.4 FRR as a fix, then we can get it by an upgrade of stable/10.4 which is in progress for 202511 release from FRR to SONiC? |
Signed-off-by: Stepan Blyschak <[email protected]>
01a88d1 to
74778c7
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@StormLiangMS I am not aware of FRR plans. This is a backport of the change done in master. |
There are two main reasons for adopting this approach: 1 Expertise in PR Review: The FRR community is better positioned to review this PR, given their deeper familiarity with the codebase. For this particular PR, since the changes are non-trivial, it is better to get bless from FRR team for adding confidence in the quality and correctness. 2 Simplified Future Cherry-Picks: Submitting the change upstream to FRR would streamline the process of backporting it to SONiC later. If conflicts arise during cherry-picking from an FRR release branch, resolving them will be easier when the change already exists in the upstream repository. Therefore, we put the following statement in https://github.com/sonic-net/SONiC/blob/master/tsc/frr/sonic_frr_update_process.md "Note: Currently, SONiC FRR maintainers are NOT responsible for cherry-picking patches across different SONiC releases. For example, applying critical patches from the FRR 10.4 branch to the SONiC 202505 branch. Such patches must first be merged upstream into FRR." |
…r-fix Signed-off-by: Stepan Blyschak <[email protected]>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Seems unrelated error: |
|
@dgsudharsan to take over this PR |
Why I did it
EOR was sent before all routes are advertised when suppress-fib-pending is enabled, this leads to packet loss, due to peer doing route reconciliation for incomplete RIB.
BACKPORT of upstream FRR fix - FRRouting/frr#19522
Work item tracking
How I did it
Backport FRR fix.
How to verify it
Setup topogy like this, setup a regular router interaface between DUT and AUX and setup BGP session between them as well as between switches and IXIA. Run bidirectional traffic towards the prefix advertised by IXIA1. Observe no packet loss during warm-reboot.
Relevant logs:
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)