[action] [PR:23970] [arp_update]Resolve neighbors from config_db#24226
Merged
mssonicbld merged 1 commit intosonic-net:202411from Oct 8, 2025
Merged
[action] [PR:23970] [arp_update]Resolve neighbors from config_db#24226mssonicbld merged 1 commit intosonic-net:202411from
mssonicbld merged 1 commit intosonic-net:202411from
Conversation
<!--
Please make sure you've read and understood our contributing guidelines:
https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md
** Make sure all your commits include a signature generated with `git commit -s` **
If this is a bug fix, make sure your description includes "fixes #xxxx", or
"closes #xxxx" or "resolves #xxxx"
Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 image (sonic-net#15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.
#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.
#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
- The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
- With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.
#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102
# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED
# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration
# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
```
##### Work item tracking
- Microsoft ADO **(number only)**:
#### How to verify it
<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
#### Which release branch to backport (provide reason below if selected)
<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->
- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505
#### Tested branch (Please provide the tested image version)
<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->
- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->
#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
<!--
Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->
#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->
#### A picture of a cute animal (not mandatory but encouraged)
Collaborator
Author
|
Original PR: #23970 |
9 tasks
Collaborator
Author
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
noaOrMlnx
pushed a commit
to noaOrMlnx/sonic-buildimage
that referenced
this pull request
Nov 24, 2025
…-net#1745) Code sync sonic-net/sonic-buildimage:202411 => 202412 ``` * e1f0789 (HEAD -> code-sync-202412, origin/code-sync-202412) r12f 251022:1851 - Merge remote-tracking branch 'base/202411' into code-sync-202412 |\ | * 5e45f5b (base/202411) mssonicbld 251008:1012 - [arp_update]Resolve neighbors from config_db (sonic-net#24226) | * fb53c97 mssonicbld 251003:0612 - Increase egress and ingress buffer pool sizes on Arista-7050CX3-32S-C28S4 (sonic-net#24192) | * 80fb6ed mssonicbld 250927:1912 - [YANG] Change VXLAN tunnel YANG model to support 2 tunnels + string validation (sonic-net#23999) | * 3435dda mssonicbld 250909:2212 - [submodule] Update submodule sonic-host-services to the latest HEAD automatically (sonic-net#23930) | * 7fccf4e mssonicbld 250907:1612 - [submodule] Update submodule sonic-host-services to the latest HEAD automatically (sonic-net#22178) | * 4a7b293 mssonicbld 250820:0713 - Fix Debian repos used for Bullseye-based containers (sonic-net#23759) | * 8eb0c1e Aravind-Subbaroyan 250818:2012 - Update cisco-8000-smartswitch.ini (sonic-net#23746) | * d6a9627 mssonicbld 250819:0512 - [TACACS] Fix memory leak when authenticating using tacacs (sonic-net#23150) | * dacee85 mssonicbld 250815:1614 - [submodule] Update submodule sonic-swss to the latest HEAD automatically (sonic-net#23608) | * 420e5fd mssonicbld 250813:1712 - [submodule] Update submodule sonic-linux-kernel to the latest HEAD automatically (sonic-net#23554) | * 39eb0b9 mssonicbld 250807:1913 - [action] [PR:22553] [build] Fix kdump build failure (Fixes 5097 17023) (sonic-net#23035) | * cb2dff4 mssonicbld 250807:0612 - [submodule] Update submodule sonic-sairedis to the latest HEAD automatically (sonic-net#23606) | * ec66359 mssonicbld 250805:2212 - Bullseye is EOL, use the archive repo (sonic-net#23589) | * 124de26 anamehra 250801:1018 - Update cisco-8000.ini to 202411.1.0.12 (sonic-net#23547) | * 8469d6f Gagan Punathil Ellath 250731:1507 - [202411][mellanox] Use the PF interface for the midplane communication with the DPU and rshim updates (sonic-net#23533) | * d51a6d3 rameshraghupathy 250729:0950 - Update 202411 branch cisco-8000-smartswitch.ini with 202411.1.0.12 (sonic-net#23504) | * 5763622 mssonicbld 250724:1611 - [submodule] Update submodule sonic-gnmi to the latest HEAD automatically (sonic-net#23452) | * 4a14f10 mssonicbld 250724:1412 - Improve GNMI_CLIENT_CERT table to support multiple roles. (sonic-net#23448) | * c3bd9e0 mssonicbld 250718:1012 - [submodule] Update submodule sonic-utilities to the latest HEAD automatically (sonic-net#23274) | * c6c864a mssonicbld 250717:0712 - [Mellanox] [Smartswitch] Fix sensors file for Smartswitch (sonic-net#23342) | * da76d4d Gagan Punathil Ellath 250715:1221 - [202411][Smartswitch] Fix SN4280 SKU pmon daemon control to skip chassisd | * 12aa5e0 Liping Xu 250715:1215 - [202411][frr]: Force disable next hop group support (sonic-net#23292) ```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changed
arp_updatescript to ensure neighbor resolution works properly after firmware upgrades and system repaves.arp_updatescript now defines kernel neighbors (KERNEIGH4andKERNEIGH6) based on different device subtype to properly handle DualToR.Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.
What is being fixed
FAILED/INCOMPLETEneighbors are excluded because this status represents connection issues.FAILED/INCOMPLETEneighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but areFAILEDin kernel wouldn't be detected as mismatches.FAILED/INCOMPLETEneighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernelFAILEDstate and APPL_DB entries.Example
Work item tracking
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)