Fix test_crm_nexthop_group: split neighbor/route into two phases by xixuej · Pull Request #23129 · sonic-net/sonic-mgmt

xixuej · 2026-03-19T07:23:14Z

Description of PR

Summary:
Fixes # (issue) This fixes the race conditions that were observed on Nvidia switches, this should also address #20563

The configure_nexthop_groups() function had two problems:

Chunk batching bug: ip_batch[1:] was intended to skip only the first IP (2.0.0.1, the base neighbor), but when batching with chunk_size=200, it skipped the first element of EVERY batch, silently losing ~9 neighbors and their routes.
Race condition: neighbor and route creation were interleaved in the same for-loop, so a route could reference a nexthop before its neighbor was fully programmed in HW.

Fix by separating into two phases and removing the chunk batching mechanism (no longer needed with the two-phase approach):

Phase 1: add all neighbors in one shot, then poll CRM ipv4_neighbor counter to confirm they are programmed in HW
Phase 2: add all routes in one shot after neighbors are confirmed

Type of change

Back port request

Approach

What is the motivation for this PR?

test_crm_nexthop_group[group_member=False] fails intermittently on msn4600 and msn4700 platforms with:
CRM counter did not reach expected value within 60 seconds.
Expected: used >= 1891, Actual: used=1807

How did you do it?

How did you verify/test it?

Any platform specific information?

Observed on Mellanox LSN4700 and SN4600C — platforms with large NHG resource pools (~180K+) that cause the test to create ~1800+ nexthop groups, widening the race window.

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2026-03-19T07:23:23Z

/azp run

azure-pipelines · 2026-03-19T07:23:37Z

Azure Pipelines successfully started running 1 pipeline(s).

xixuej · 2026-03-21T14:39:42Z

/azpw run

mssonicbld · 2026-03-21T14:39:45Z

/AzurePipelines run

azure-pipelines · 2026-03-21T14:39:58Z

Azure Pipelines successfully started running 1 pipeline(s).

The configure_nexthop_groups() function had two problems: 1. Chunk batching bug: ip_batch[1:] was intended to skip only the first IP (2.0.0.1, the base neighbor), but when batching with chunk_size=200, it skipped the first element of EVERY batch, silently losing ~9 neighbors and their routes. 2. Race condition: neighbor and route creation were interleaved in the same for-loop, so a route could reference a nexthop before its neighbor was fully programmed in HW. Fix by separating into two phases and removing the chunk batching mechanism (no longer needed with the two-phase approach): - Phase 1: add all neighbors in one shot, then poll CRM ipv4_neighbor counter to confirm they are programmed in HW - Phase 2: add all routes in one shot after neighbors are confirmed Signed-off-by: Xixue Jia <[email protected]>

tests/crm/test_crm.py

+
    del_template = Template(del_template)
-    add_template = Template(add_template)
+    neigh_template = Template(neigh_template)


tests/crm/test_crm.py

    del_template = Template(del_template)
-    add_template = Template(add_template)
+    neigh_template = Template(neigh_template)
+    route_template = Template(route_template)


tests/crm/test_crm.py

-    add_template = Template(add_template)
+    neigh_template = Template(neigh_template)
+    route_template = Template(route_template)
+    init_template = Template(init_template)


github-actions bot requested review from arlakshm, xwjiang-ms and yutongzhang-microsoft March 19, 2026 07:23

Copilot AI mentioned this pull request Mar 19, 2026

Review human-authored PRs opened in the past 24 hours #23134

Draft

nhe-NV previously approved these changes Mar 21, 2026

View reviewed changes

nhe-NV added the Request for 202511 branch Request to backport a change to 202511 branch label Mar 21, 2026

xixuej dismissed nhe-NV’s stale review via d1154f0 March 23, 2026 04:49

xixuej force-pushed the crm_fix branch from 5565a11 to d1154f0 Compare March 23, 2026 04:49

github-advanced-security bot found potential problems Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test_crm_nexthop_group: split neighbor/route into two phases#23129

Fix test_crm_nexthop_group: split neighbor/route into two phases#23129
xixuej wants to merge 1 commit intosonic-net:masterfrom
xixuej:crm_fix

xixuej commented Mar 19, 2026

Uh oh!

mssonicbld commented Mar 19, 2026

Uh oh!

azure-pipelines bot commented Mar 19, 2026

Uh oh!

xixuej commented Mar 21, 2026

Uh oh!

mssonicbld commented Mar 21, 2026

Uh oh!

azure-pipelines bot commented Mar 21, 2026

Uh oh!

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xixuej commented Mar 19, 2026

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Mar 19, 2026

Uh oh!

azure-pipelines bot commented Mar 19, 2026

Uh oh!

xixuej commented Mar 21, 2026

Uh oh!

mssonicbld commented Mar 21, 2026

Uh oh!

azure-pipelines bot commented Mar 21, 2026

Uh oh!

Check warning

Check warning

Check warning

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants