Skip to content

[frr]: update next hop group support by metadata value with disabled as the default value#23500

Merged
eddieruan-alibaba merged 6 commits intosonic-net:masterfrom
lipxu:20250728_publicMaster_zebraNextHopGroup
Dec 23, 2025
Merged

[frr]: update next hop group support by metadata value with disabled as the default value#23500
eddieruan-alibaba merged 6 commits intosonic-net:masterfrom
lipxu:20250728_publicMaster_zebraNextHopGroup

Conversation

@lipxu
Copy link
Contributor

@lipxu lipxu commented Jul 28, 2025

update next hop group support by metadata value with disabled as the default value

Why I did it

Netscan loss observed on production devices

Work item tracking
  • Microsoft ADO (number only):
    33406776

How I did it

update next hop group support by metadata value with disabled as the default value

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@lipxu lipxu requested a review from dgsudharsan July 28, 2025 08:52
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dgsudharsan
Copy link
Collaborator

Hi @lipxu Can you please clarify why this is needed in master branch? Is there any issue observed?
nexthop group support in SONiC is being currently worked on. @hasan-brcm @eddieruan-alibaba FYI

@eddieruan-alibaba
Copy link
Collaborator

Should we discuss this PR in this week's WG meeting to understand the motivation for this change ?

@lipxu @hasan-brcm @dgsudharsan

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu
Copy link
Contributor Author

lipxu commented Aug 6, 2025

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu lipxu requested a review from qiluo-msft as a code owner August 7, 2025 03:24
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lipxu
Copy link
Contributor Author

lipxu commented Aug 14, 2025

Hi @lipxu Can you please clarify why this is needed in master branch? Is there any issue observed? nexthop group support in SONiC is being currently worked on. @hasan-brcm @eddieruan-alibaba FYI

Yes, @dgsudharsan , as we synced in the meeting, there is an issue in production.
As an agreement in the meeting, we plan to use a metadata config, and init the nexthop group support based on this configuration.
Please help to review the PR, thanks a lot.

echo "fpm address 127.0.0.1" >> $FILE_NAME
}

grep -q '^no zebra nexthop kernel enable' $FILE_NAME || {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic will create problem when zebra_nexthop is enabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminder @dgsudharsan , this script should only be called during initialization. and the enable->disable case should not occur in production. Please correct me if anything wrong, thanks

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if we have the config in /etc/sonic/config_db.json
DEVICE_METADATA['localhost']['zebra_nexthop'] == 'enabled'

This logic will override will result in having both 'no zebra nexthop kernel enable' and 'zebra nexthop kernel enable' in the file. Can you please confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @dgsudharsan , the initial config is generated based on the metadata, so I don't think there should be a chance for both commands to appear in the config. It should behave the same way as the nexthop_group , thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgsudharsan Could you please help to review the PR again when free, thank you very much

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, this script will run at init irrespective of the value in device metadata after the zebra.json is rendered.
This means the file before this step will have "zebra nexthop kernel enable"
Now this check will return false for the grep and the 2nd part would execute appending "no zebra nexthop kernel enable". Can you please test this flow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @lipxu kind of agree with Sud, we may consider to use only zebra.conf.j2 to generate the configuration we need other than having it in the docker_init.sh also?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgsudharsan and @StormLiangMS , You are absolutely right. I checked the history, and this code appears to be legacy from the cherry-pick “Force disable next hop group support.” The tests all passed because we didn’t have metadata now. I’ve removed this code. Thanks for your reminder.

By the way, "no fpm use-next-hop-groups" likely has the same issue and should also be removed after syncing with the owner. Thanks.

    grep -q '^no fpm use-next-hop-groups' $FILE_NAME || {
        echo "no fpm use-next-hop-groups" >> $FILE_NAME
        echo "fpm address 127.0.0.1" >> $FILE_NAME
    }

default "false";
}

leaf zebra_nexthop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we reuse nexthop_group that is already available in device metadata?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I am not sure what the nexthop_group flag is for, does it indicate the whole feature status? Can we also use it for the zebra disable nexthop case? thanks a lot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eddieruan-alibaba can clarify this but i believe it is to indicate the whole feature sonic-net/SONiC#1425

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgsudharsan @eddieruan-alibaba , could you please help to confirm why we could reuse the nexthop_group , thanks a lot

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ntt-omw @eddieruan-alibaba can you please clarify the above question

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgsudharsan The nexthop_group feature requires FRR’s fpm use-next-hop-groups to be enabled. For backward compatibility, this feature is disabled by default. The enable/disable control was implemented upon request from the Routing WG to allow more flexible control of the feature status.

How to enable / disable using NHG. The conclusion is to use zebra command line arguments to enable/disable this feature. No run time change is allow. For SONiC deployment, the configuration would be driven from config db to provide proper launch commands for zebra and fpmsyncd.
https://lists.sonicfoundation.dev/g/sonic-wg-routing/wiki/34834

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dgsudharsan @nakano-omw ,
So could I understand we still need to keep the separate metadata parameter to control the FRR command, correct? thanks
fpm use-next-hop-groups
no zebra nexthop kernel enable

Copilot AI review requested due to automatic review settings October 20, 2025 23:31
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR disables next hop group support in FRR (Free Range Routing) to address netscan loss issues observed on production devices. The change introduces a new configuration option zebra_nexthop that defaults to disabled, forcing the use of legacy routing behavior.

  • Added zebra_nexthop configuration field to device metadata with disabled as the default value
  • Updated zebra configuration template to conditionally disable next hop group kernel support
  • Added test coverage for the new configuration option

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/sonic-yang-models/yang-models/sonic-device_metadata.yang Added zebra_nexthop enum field with enabled/disabled values, defaulting to disabled
src/sonic-yang-models/tests/yang_model_tests/tests_config/device_metadata.json Added test case configuration for valid zebra_nexthop setting
src/sonic-yang-models/tests/yang_model_tests/tests/device_metadata.json Added test case descriptor for zebra_nexthop validation
src/sonic-yang-models/tests/files/sample_config_db.json Updated sample config to include zebra_nexthop: disabled
src/sonic-config-engine/tests/sample_output/py3/*.conf Updated test output files to include the nexthop kernel disable directive
src/sonic-config-engine/tests/sample_output/py2/*.conf Updated test output files to include the nexthop kernel disable directive
src/sonic-bgpcfgd/tests/data/sonic-cfggen/zebra/zebra.conf Updated test data with nexthop kernel disable directive
platform/vs/docker-sonic-vs/frr/zebra.conf Added nexthop kernel disable directive to virtual switch configuration
dockers/docker-fpm-frr/frr/zebra/zebra.conf.j2 Added Jinja template logic to conditionally enable/disable zebra nexthop kernel support
dockers/docker-fpm-frr/docker_init.sh Added fallback configuration to ensure nexthop kernel disable is present

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

}

leaf zebra_nexthop {
description "Enable or disable next hop group support. This value only takes effect during boot time";
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description mentions 'next hop group support' but the field is named zebra_nexthop which controls 'nexthop kernel enable'. Consider clarifying that this controls zebra's kernel nexthop support specifically, not FPM's nexthop groups (which is controlled by the separate nexthop_group field).

Suggested change
description "Enable or disable next hop group support. This value only takes effect during boot time";
description "Enable or disable zebra's kernel nexthop support. This value only takes effect during boot time. Note: This does not control FPM's nexthop group support, which is managed by the separate 'nexthop_group' field.";

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +11
{% if (('localhost' in DEVICE_METADATA) and ('zebra_nexthop' in DEVICE_METADATA['localhost']) and
(DEVICE_METADATA['localhost']['zebra_nexthop'] == 'enabled')) %}
Copy link

Copilot AI Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The condition checks are duplicated between the zebra_nexthop block (lines 10-11) and the existing fpm block (lines 20-21). Consider extracting the metadata access pattern into a Jinja variable for better maintainability and consistency.

Copilot uses AI. Check for mistakes.
@lipxu lipxu changed the title [frr]: Force disable next hop group support [frr]: update next hop group support by metadata value with disabled as the default value Oct 20, 2025
@lipxu
Copy link
Contributor Author

lipxu commented Oct 29, 2025

@dgsudharsan , I've updated the comments and merged the latest master。 Could you please help review it again, thanks a lot

@lipxu
Copy link
Contributor Author

lipxu commented Nov 24, 2025

@dgsudharsan , I've updated the comments and merged the latest master。 Could you please help review it again, thanks a lot

Hi, @dgsudharsan , Just want to know if you have a chance to review the PR again, thanks a lot.

echo "fpm address 127.0.0.1" >> $FILE_NAME
}

grep -q '^no zebra nexthop kernel enable' $FILE_NAME || {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to understand, this script will run at init irrespective of the value in device metadata after the zebra.json is rendered.
This means the file before this step will have "zebra nexthop kernel enable"
Now this check will return false for the grep and the 2nd part would execute appending "no zebra nexthop kernel enable". Can you please test this flow?

@lipxu lipxu force-pushed the 20250728_publicMaster_zebraNextHopGroup branch from 4398426 to 995d268 Compare December 22, 2025 03:40
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@StormLiangMS StormLiangMS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@StormLiangMS
Copy link
Contributor

hi @eddieruan-alibaba would you help to take a look? We'd like to have one option to unblock current issue on high density BGP sessions platform and Chassis, to buy us some time to root cause this one.

@eddieruan-alibaba
Copy link
Collaborator

@StormLiangMS

when no zebra nexthop kernel enable is configured, would we still need to allow enable nexthop group for fpm?

{% block fpm %}
{% if ( ('localhost' in DEVICE_METADATA) and ('nexthop_group' in DEVICE_METADATA['localhost']) and
(DEVICE_METADATA['localhost']['nexthop_group'] == 'enabled') ) %}

@StormLiangMS
Copy link
Contributor

hi @eddieruan-alibaba

@StormLiangMS

when no zebra nexthop kernel enable is configured, would we still need to allow enable nexthop group for fpm?

{% block fpm %} {% if ( ('localhost' in DEVICE_METADATA) and ('nexthop_group' in DEVICE_METADATA['localhost']) and (DEVICE_METADATA['localhost']['nexthop_group'] == 'enabled') ) %}

@eddieruan-alibaba From what nakano-omw, to have nexthop_group would allow finer granularity on the control and required from routing WG. We'd like to keep it as separate for now, since the "no zebra kernel nexthop enable" should be removed eventually after we finalized the solution to fix the rejected routes issue.

@eddieruan-alibaba
Copy link
Collaborator

hi @eddieruan-alibaba

@StormLiangMS
when no zebra nexthop kernel enable is configured, would we still need to allow enable nexthop group for fpm?
{% block fpm %} {% if ( ('localhost' in DEVICE_METADATA) and ('nexthop_group' in DEVICE_METADATA['localhost']) and (DEVICE_METADATA['localhost']['nexthop_group'] == 'enabled') ) %}

@eddieruan-alibaba From what nakano-omw, to have nexthop_group would allow finer granularity on the control and required from routing WG. We'd like to keep it as separate for now, since the "no zebra kernel nexthop enable" should be removed eventually after we finalized the solution to fix the rejected routes issue.

Sounds good. So we treat

"no zebra kernel nexthop enable"

as a workaround until we finalize the solution.

@eddieruan-alibaba eddieruan-alibaba merged commit 1d8797b into sonic-net:master Dec 23, 2025
23 checks passed
lipxu added a commit to lipxu/sonic-buildimage that referenced this pull request Jan 7, 2026
…as the default value (sonic-net#23500)

* [202411][frr]: Force disable next hop group support (sonic-net#23292)

Signed-off-by: Liping Xu <xuliping@microsoft.com>

---------

Signed-off-by: Liping Xu <xuliping@microsoft.com>
@lipxu
Copy link
Contributor Author

lipxu commented Jan 7, 2026

Create a PR for 202511 manually due to conflict #24997

vmittal-msft pushed a commit that referenced this pull request Jan 9, 2026
…as the default value (#23500) (#24997)

* [202411][frr]: Force disable next hop group support (#23292)



---------

Signed-off-by: Liping Xu <xuliping@microsoft.com>
@r12f
Copy link
Contributor

r12f commented Jan 15, 2026

Fixing tag as manual pick is merged.

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #25088

jasonbridges pushed a commit to jasonbridges/sonic-buildimage that referenced this pull request Jan 22, 2026
…as the default value (sonic-net#23500)

* [202411][frr]: Force disable next hop group support (sonic-net#23292)

[202411][frr]: Force disable next hop group support

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update metadata

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* remove default set in docker init

Signed-off-by: Liping Xu <xuliping@microsoft.com>

---------

Signed-off-by: Liping Xu <xuliping@microsoft.com>
FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Mar 6, 2026
…as the default value (sonic-net#23500)

* [202411][frr]: Force disable next hop group support (sonic-net#23292)

[202411][frr]: Force disable next hop group support

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update metadata

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* remove default set in docker init

Signed-off-by: Liping Xu <xuliping@microsoft.com>

---------

Signed-off-by: Liping Xu <xuliping@microsoft.com>
Signed-off-by: Feng Pan <fenpan@microsoft.com>
dprital pushed a commit that referenced this pull request Mar 19, 2026
…as the default value (#23500)

* [202411][frr]: Force disable next hop group support (#23292)

[202411][frr]: Force disable next hop group support

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* update metadata

Signed-off-by: Liping Xu <xuliping@microsoft.com>

* remove default set in docker init

Signed-off-by: Liping Xu <xuliping@microsoft.com>

---------

Signed-off-by: Liping Xu <xuliping@microsoft.com>
Signed-off-by: dprital <drorp@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants