Skip to content

Fix test_crm_nexthop_group: split neighbor/route into two phases#23129

Open
xixuej wants to merge 1 commit intosonic-net:masterfrom
xixuej:crm_fix
Open

Fix test_crm_nexthop_group: split neighbor/route into two phases#23129
xixuej wants to merge 1 commit intosonic-net:masterfrom
xixuej:crm_fix

Conversation

@xixuej
Copy link
Contributor

@xixuej xixuej commented Mar 19, 2026

Description of PR

Summary:
Fixes # (issue) This fixes the race conditions that were observed on Nvidia switches, this should also address #20563

The configure_nexthop_groups() function had two problems:

  1. Chunk batching bug: ip_batch[1:] was intended to skip only the first IP (2.0.0.1, the base neighbor), but when batching with chunk_size=200, it skipped the first element of EVERY batch, silently losing ~9 neighbors and their routes.

  2. Race condition: neighbor and route creation were interleaved in the same for-loop, so a route could reference a nexthop before its neighbor was fully programmed in HW.

Fix by separating into two phases and removing the chunk batching mechanism (no longer needed with the two-phase approach):

  • Phase 1: add all neighbors in one shot, then poll CRM ipv4_neighbor counter to confirm they are programmed in HW
  • Phase 2: add all routes in one shot after neighbors are confirmed

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

test_crm_nexthop_group[group_member=False] fails intermittently on msn4600 and msn4700 platforms with:
CRM counter did not reach expected value within 60 seconds.
Expected: used >= 1891, Actual: used=1807

How did you do it?

How did you verify/test it?

Any platform specific information?

Observed on Mellanox LSN4700 and SN4600C — platforms with large NHG resource pools (~180K+) that cause the test to create ~1800+ nexthop groups, widening the race window.

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

nhe-NV
nhe-NV previously approved these changes Mar 21, 2026
@nhe-NV nhe-NV added the Request for 202511 branch Request to backport a change to 202511 branch label Mar 21, 2026
@xixuej
Copy link
Contributor Author

xixuej commented Mar 21, 2026

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

The configure_nexthop_groups() function had two problems:

1. Chunk batching bug: ip_batch[1:] was intended to skip only the first
   IP (2.0.0.1, the base neighbor), but when batching with
   chunk_size=200, it skipped the first element of EVERY batch, silently
   losing ~9 neighbors and their routes.

2. Race condition: neighbor and route creation were interleaved in the
   same for-loop, so a route could reference a nexthop before its
   neighbor was fully programmed in HW.

Fix by separating into two phases and removing the chunk batching
mechanism (no longer needed with the two-phase approach):
- Phase 1: add all neighbors in one shot, then poll CRM
  ipv4_neighbor counter to confirm they are programmed in HW
- Phase 2: add all routes in one shot after neighbors are confirmed

Signed-off-by: Xixue Jia <[email protected]>

del_template = Template(del_template)
add_template = Template(add_template)
neigh_template = Template(neigh_template)

Check warning

Code scanning / CodeQL

Jinja2 templating with autoescape=False Medium test

Using jinja2 templates with autoescape=False can potentially allow XSS attacks.
del_template = Template(del_template)
add_template = Template(add_template)
neigh_template = Template(neigh_template)
route_template = Template(route_template)

Check warning

Code scanning / CodeQL

Jinja2 templating with autoescape=False Medium test

Using jinja2 templates with autoescape=False can potentially allow XSS attacks.
add_template = Template(add_template)
neigh_template = Template(neigh_template)
route_template = Template(route_template)
init_template = Template(init_template)

Check warning

Code scanning / CodeQL

Jinja2 templating with autoescape=False Medium test

Using jinja2 templates with autoescape=False can potentially allow XSS attacks.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Request for 202511 branch Request to backport a change to 202511 branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants