[frr]: update next hop group support by metadata value with disabled as the default value#23500
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @lipxu Can you please clarify why this is needed in master branch? Is there any issue observed? |
|
Should we discuss this PR in this week's WG meeting to understand the motivation for this change ? |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw run Azure.sonic-buildimage |
|
/AzurePipelines run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Yes, @dgsudharsan , as we synced in the meeting, there is an issue in production. |
| echo "fpm address 127.0.0.1" >> $FILE_NAME | ||
| } | ||
|
|
||
| grep -q '^no zebra nexthop kernel enable' $FILE_NAME || { |
There was a problem hiding this comment.
This logic will create problem when zebra_nexthop is enabled
There was a problem hiding this comment.
Thanks for reminder @dgsudharsan , this script should only be called during initialization. and the enable->disable case should not occur in production. Please correct me if anything wrong, thanks
There was a problem hiding this comment.
But what if we have the config in /etc/sonic/config_db.json
DEVICE_METADATA['localhost']['zebra_nexthop'] == 'enabled'
This logic will override will result in having both 'no zebra nexthop kernel enable' and 'zebra nexthop kernel enable' in the file. Can you please confirm?
There was a problem hiding this comment.
thanks @dgsudharsan , the initial config is generated based on the metadata, so I don't think there should be a chance for both commands to appear in the config. It should behave the same way as the nexthop_group , thanks
There was a problem hiding this comment.
@dgsudharsan Could you please help to review the PR again when free, thank you very much
There was a problem hiding this comment.
Just to understand, this script will run at init irrespective of the value in device metadata after the zebra.json is rendered.
This means the file before this step will have "zebra nexthop kernel enable"
Now this check will return false for the grep and the 2nd part would execute appending "no zebra nexthop kernel enable". Can you please test this flow?
There was a problem hiding this comment.
hi @lipxu kind of agree with Sud, we may consider to use only zebra.conf.j2 to generate the configuration we need other than having it in the docker_init.sh also?
There was a problem hiding this comment.
Thanks @dgsudharsan and @StormLiangMS , You are absolutely right. I checked the history, and this code appears to be legacy from the cherry-pick “Force disable next hop group support.” The tests all passed because we didn’t have metadata now. I’ve removed this code. Thanks for your reminder.
By the way, "no fpm use-next-hop-groups" likely has the same issue and should also be removed after syncing with the owner. Thanks.
grep -q '^no fpm use-next-hop-groups' $FILE_NAME || {
echo "no fpm use-next-hop-groups" >> $FILE_NAME
echo "fpm address 127.0.0.1" >> $FILE_NAME
}
| default "false"; | ||
| } | ||
|
|
||
| leaf zebra_nexthop { |
There was a problem hiding this comment.
Why can't we reuse nexthop_group that is already available in device metadata?
There was a problem hiding this comment.
Thanks, I am not sure what the nexthop_group flag is for, does it indicate the whole feature status? Can we also use it for the zebra disable nexthop case? thanks a lot
There was a problem hiding this comment.
@eddieruan-alibaba can clarify this but i believe it is to indicate the whole feature sonic-net/SONiC#1425
There was a problem hiding this comment.
@dgsudharsan @eddieruan-alibaba , could you please help to confirm why we could reuse the nexthop_group , thanks a lot
There was a problem hiding this comment.
@ntt-omw @eddieruan-alibaba can you please clarify the above question
There was a problem hiding this comment.
@dgsudharsan The nexthop_group feature requires FRR’s fpm use-next-hop-groups to be enabled. For backward compatibility, this feature is disabled by default. The enable/disable control was implemented upon request from the Routing WG to allow more flexible control of the feature status.
How to enable / disable using NHG. The conclusion is to use zebra command line arguments to enable/disable this feature. No run time change is allow. For SONiC deployment, the configuration would be driven from config db to provide proper launch commands for zebra and fpmsyncd.
https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki/34834
There was a problem hiding this comment.
Thanks @dgsudharsan @nakano-omw ,
So could I understand we still need to keep the separate metadata parameter to control the FRR command, correct? thanks
fpm use-next-hop-groups
no zebra nexthop kernel enable
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull Request Overview
This PR disables next hop group support in FRR (Free Range Routing) to address netscan loss issues observed on production devices. The change introduces a new configuration option zebra_nexthop that defaults to disabled, forcing the use of legacy routing behavior.
- Added
zebra_nexthopconfiguration field to device metadata withdisabledas the default value - Updated zebra configuration template to conditionally disable next hop group kernel support
- Added test coverage for the new configuration option
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/sonic-yang-models/yang-models/sonic-device_metadata.yang | Added zebra_nexthop enum field with enabled/disabled values, defaulting to disabled |
| src/sonic-yang-models/tests/yang_model_tests/tests_config/device_metadata.json | Added test case configuration for valid zebra_nexthop setting |
| src/sonic-yang-models/tests/yang_model_tests/tests/device_metadata.json | Added test case descriptor for zebra_nexthop validation |
| src/sonic-yang-models/tests/files/sample_config_db.json | Updated sample config to include zebra_nexthop: disabled |
| src/sonic-config-engine/tests/sample_output/py3/*.conf | Updated test output files to include the nexthop kernel disable directive |
| src/sonic-config-engine/tests/sample_output/py2/*.conf | Updated test output files to include the nexthop kernel disable directive |
| src/sonic-bgpcfgd/tests/data/sonic-cfggen/zebra/zebra.conf | Updated test data with nexthop kernel disable directive |
| platform/vs/docker-sonic-vs/frr/zebra.conf | Added nexthop kernel disable directive to virtual switch configuration |
| dockers/docker-fpm-frr/frr/zebra/zebra.conf.j2 | Added Jinja template logic to conditionally enable/disable zebra nexthop kernel support |
| dockers/docker-fpm-frr/docker_init.sh | Added fallback configuration to ensure nexthop kernel disable is present |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| } | ||
|
|
||
| leaf zebra_nexthop { | ||
| description "Enable or disable next hop group support. This value only takes effect during boot time"; |
There was a problem hiding this comment.
The description mentions 'next hop group support' but the field is named zebra_nexthop which controls 'nexthop kernel enable'. Consider clarifying that this controls zebra's kernel nexthop support specifically, not FPM's nexthop groups (which is controlled by the separate nexthop_group field).
| description "Enable or disable next hop group support. This value only takes effect during boot time"; | |
| description "Enable or disable zebra's kernel nexthop support. This value only takes effect during boot time. Note: This does not control FPM's nexthop group support, which is managed by the separate 'nexthop_group' field."; |
| {% if (('localhost' in DEVICE_METADATA) and ('zebra_nexthop' in DEVICE_METADATA['localhost']) and | ||
| (DEVICE_METADATA['localhost']['zebra_nexthop'] == 'enabled')) %} |
There was a problem hiding this comment.
[nitpick] The condition checks are duplicated between the zebra_nexthop block (lines 10-11) and the existing fpm block (lines 20-21). Consider extracting the metadata access pattern into a Jinja variable for better maintainability and consistency.
|
@dgsudharsan , I've updated the comments and merged the latest master。 Could you please help review it again, thanks a lot |
Hi, @dgsudharsan , Just want to know if you have a chance to review the PR again, thanks a lot. |
| echo "fpm address 127.0.0.1" >> $FILE_NAME | ||
| } | ||
|
|
||
| grep -q '^no zebra nexthop kernel enable' $FILE_NAME || { |
There was a problem hiding this comment.
Just to understand, this script will run at init irrespective of the value in device metadata after the zebra.json is rendered.
This means the file before this step will have "zebra nexthop kernel enable"
Now this check will return false for the grep and the 2nd part would execute appending "no zebra nexthop kernel enable". Can you please test this flow?
4398426 to
995d268
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
hi @eddieruan-alibaba would you help to take a look? We'd like to have one option to unblock current issue on high density BGP sessions platform and Chassis, to buy us some time to root cause this one. |
|
when no zebra nexthop kernel enable is configured, would we still need to allow enable nexthop group for fpm? {% block fpm %} |
@eddieruan-alibaba From what nakano-omw, to have nexthop_group would allow finer granularity on the control and required from routing WG. We'd like to keep it as separate for now, since the "no zebra kernel nexthop enable" should be removed eventually after we finalized the solution to fix the rejected routes issue. |
Sounds good. So we treat "no zebra kernel nexthop enable" as a workaround until we finalize the solution. |
…as the default value (sonic-net#23500) * [202411][frr]: Force disable next hop group support (sonic-net#23292) Signed-off-by: Liping Xu <xuliping@microsoft.com> --------- Signed-off-by: Liping Xu <xuliping@microsoft.com>
|
Create a PR for 202511 manually due to conflict #24997 |
|
Fixing tag as manual pick is merged. |
|
Cherry-pick PR to 202511: #25088 |
…as the default value (sonic-net#23500) * [202411][frr]: Force disable next hop group support (sonic-net#23292) [202411][frr]: Force disable next hop group support Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update metadata Signed-off-by: Liping Xu <xuliping@microsoft.com> * remove default set in docker init Signed-off-by: Liping Xu <xuliping@microsoft.com> --------- Signed-off-by: Liping Xu <xuliping@microsoft.com>
…as the default value (sonic-net#23500) * [202411][frr]: Force disable next hop group support (sonic-net#23292) [202411][frr]: Force disable next hop group support Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update metadata Signed-off-by: Liping Xu <xuliping@microsoft.com> * remove default set in docker init Signed-off-by: Liping Xu <xuliping@microsoft.com> --------- Signed-off-by: Liping Xu <xuliping@microsoft.com> Signed-off-by: Feng Pan <fenpan@microsoft.com>
…as the default value (#23500) * [202411][frr]: Force disable next hop group support (#23292) [202411][frr]: Force disable next hop group support Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update Signed-off-by: Liping Xu <xuliping@microsoft.com> * update metadata Signed-off-by: Liping Xu <xuliping@microsoft.com> * remove default set in docker init Signed-off-by: Liping Xu <xuliping@microsoft.com> --------- Signed-off-by: Liping Xu <xuliping@microsoft.com> Signed-off-by: dprital <drorp@nvidia.com>
update next hop group support by metadata value with disabled as the default value
Why I did it
Netscan loss observed on production devices
Work item tracking
33406776
How I did it
update next hop group support by metadata value with disabled as the default value
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)