[crm] Fix test failures on SN4700 by batching neighbor creation#19003
[crm] Fix test failures on SN4700 by batching neighbor creation#19003bingwang-ms merged 3 commits intosonic-net:masterfrom
Conversation
test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@Ndancejic Can you help me understand |
I noticed that some of the neighbor entries being added at the start of the test were not being notified in appdb. appdb gets these from netlink when programming neighbors through the kernel. From what I understand if there's a high volume, these can be dropped from the kernel sometimes. In this case we were programming ~2000 entries in a single |
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
|
Cherry-pick PR to 202411: #20092 |
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
|
Cherry-pick PR to 202505: #20138 |
* [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…) (#20092) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Co-authored-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Signed-off-by: Guy Shemesh <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Signed-off-by: Aharon Malkin <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Signed-off-by: Guy Shemesh <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Signed-off-by: Guy Shemesh <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]> Signed-off-by: Yael Tzur <[email protected]>
…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>
Description of PR
test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors:
This PR addresses the first issue.
During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped.
Summary:
Fixes sonic-net/sonic-buildimage#21243
Type of change
Back port request
Approach
What is the motivation for this PR?
CRM test was failing on dualtor sn4700 platforms
How did you do it?
buffered neighbor creation to prevent updates from being dropped
How did you verify/test it?
ran test case on sn4700 dualto testbed
Any platform specific information?
only changes test run on sn4700
Documentation
ado: #33294907