Skip to content

[action] [PR:19003] [crm] Fix test failures on SN4700 by batching neighbor creation#20138

Merged
mssonicbld merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/19003
Aug 8, 2025
Merged

[action] [PR:19003] [crm] Fix test failures on SN4700 by batching neighbor creation#20138
mssonicbld merged 1 commit intosonic-net:202505from
mssonicbld:cherry/202505/19003

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors:

  • Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop.
  • tunnel-route being removed when nexthop group was deleted
  • route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes)
  • next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped.

Summary:
Fixes sonic-net/sonic-buildimage#21243

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
  • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

CRM test was failing on dualtor sn4700 platforms

How did you do it?

buffered neighbor creation to prevent updates from being dropped

How did you verify/test it?

ran test case on sn4700 dualto testbed

Any platform specific information?

only changes test run on sn4700

Documentation

ado: #33294907

…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: #19003

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit 47a9160 into sonic-net:202505 Aug 8, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants