Skip to content

[crm] Fix test failures on SN4700 by batching neighbor creation#19003

Merged
bingwang-ms merged 3 commits intosonic-net:masterfrom
Ndancejic:batch_neighbors
Aug 6, 2025
Merged

[crm] Fix test failures on SN4700 by batching neighbor creation#19003
bingwang-ms merged 3 commits intosonic-net:masterfrom
Ndancejic:batch_neighbors

Conversation

@Ndancejic
Copy link
Copy Markdown
Contributor

Description of PR

test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors:

  • Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop.
  • tunnel-route being removed when nexthop group was deleted
  • route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes)
  • next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped.

Summary:
Fixes sonic-net/sonic-buildimage#21243

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

CRM test was failing on dualtor sn4700 platforms

How did you do it?

buffered neighbor creation to prevent updates from being dropped

How did you verify/test it?

ran test case on sn4700 dualto testbed

Any platform specific information?

only changes test run on sn4700

Documentation

ado: #33294907

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@lolyu lolyu requested a review from bingwang-ms July 3, 2025 05:08
@bingwang-ms
Copy link
Copy Markdown
Collaborator

@Ndancejic Can you help me understand During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. ? Is it possible that doing it in batch hides some real issue?

@Ndancejic
Copy link
Copy Markdown
Contributor Author

@Ndancejic Can you help me understand During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. ? Is it possible that doing it in batch hides some real issue?

I noticed that some of the neighbor entries being added at the start of the test were not being notified in appdb. appdb gets these from netlink when programming neighbors through the kernel. From what I understand if there's a high volume, these can be dropped from the kernel sometimes. In this case we were programming ~2000 entries in a single ip neigh replace command

Copy link
Copy Markdown
Collaborator

@lolyu lolyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bingwang-ms bingwang-ms merged commit 69a65ac into sonic-net:master Aug 6, 2025
13 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Aug 6, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202411: #20092

nissampa pushed a commit to nissampa/sonic-mgmt_dpu_test that referenced this pull request Aug 7, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Aug 8, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #20138

mssonicbld pushed a commit that referenced this pull request Aug 8, 2025
* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
ashutosh-agrawal pushed a commit to ashutosh-agrawal/sonic-mgmt that referenced this pull request Aug 14, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
bingwang-ms pushed a commit that referenced this pull request Aug 19, 2025
…) (#20092)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.



* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Co-authored-by: Nikola Dancejic <[email protected]>
vidyac86 pushed a commit to vidyac86/sonic-mgmt that referenced this pull request Oct 23, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
opcoder0 pushed a commit to opcoder0/sonic-mgmt that referenced this pull request Dec 8, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 16, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Guy Shemesh <[email protected]>
AharonMalkin pushed a commit to AharonMalkin/sonic-mgmt that referenced this pull request Dec 16, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Aharon Malkin <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Dec 21, 2025
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Guy Shemesh <[email protected]>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Jan 13, 2026
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
gshemesh2 pushed a commit to gshemesh2/sonic-mgmt that referenced this pull request Jan 26, 2026
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Guy Shemesh <[email protected]>
ytzur1 pushed a commit to ytzur1/sonic-mgmt that referenced this pull request Feb 2, 2026
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Signed-off-by: Yael Tzur <[email protected]>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 27, 2026
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[dualtor] CRM test fails on test_crm_nexthop_group when tunnel route created for PortChannel neighbor

5 participants