Skip to content

[master] Resolve neighbors from config_db#15006

Merged
prsunny merged 1 commit intosonic-net:masterfrom
anish-n:arp_update_master
May 17, 2023
Merged

[master] Resolve neighbors from config_db#15006
prsunny merged 1 commit intosonic-net:masterfrom
anish-n:arp_update_master

Conversation

@anish-n
Copy link
Contributor

@anish-n anish-n commented May 10, 2023

Why I did it

To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Modify arp_update script to take NEIGH entries from config db for resolution. For failed entries trigger a ping6/ping command for resolution

How to verify it

Configure NEIGH table in config_db and check if it gets resolved.

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@anish-n anish-n requested a review from lguohan as a code owner May 10, 2023 17:53
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 10, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: anish-n / name: Anish Narsian (13fef74)

@anish-n anish-n force-pushed the arp_update_master branch from 04362ba to 13fef74 Compare May 10, 2023 18:03
@prsunny
Copy link
Contributor

prsunny commented May 16, 2023

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny merged commit 05a85b5 into sonic-net:master May 17, 2023
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request May 17, 2023
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
mssonicbld pushed a commit to mssonicbld/sonic-buildimage that referenced this pull request May 17, 2023
* To resolve NEIGH table entries present in CONFIG_DB. Without this change arp/ndp entries which we wish to resolve, and configured via CONFIG_DB are not resolved.
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202211: #15123

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202205: #15124

lguohan pushed a commit that referenced this pull request Sep 26, 2025
What changed
This change is being implemented in arp_update script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
This change was originally developed and validated on 202205, 202211 image ([master] Resolve neighbors from config_db #15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
The arp_update script now defines kernel neighbors (KERNEIGH4 and KERNEIGH6) based on different device subtype to properly handle DualToR.

Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

What is being fixed
In Non-DualToRs, FAILED/INCOMPLETE neighbors are excluded because this status represents connection issues.
In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
The original code excluded FAILED/INCOMPLETE neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are FAILED in kernel wouldn't be detected as mismatches.
With the fix (post_upgrade), the standby ToR will include FAILED/INCOMPLETE neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel FAILED state and APPL_DB entries.
Example
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED  
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE  # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable


Co-authored-by: anish-n <[email protected]>
mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Oct 7, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 image (sonic-net#15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.

#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
    - The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
    - With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.

#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE  # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit that referenced this pull request Oct 8, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 image (#15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.

#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
 - The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
 - With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.

#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit to mssonicbld/sonic-buildimage that referenced this pull request Oct 10, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 images (sonic-net#15006), and is now being backported to 202411 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.

#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
    - The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
    - With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.

#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE  # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
mssonicbld added a commit that referenced this pull request Oct 10, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 images (#15006), and is now being backported to 202411 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.

#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
 - The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
 - With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.

#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Dec 4, 2025
What changed
This change is being implemented in arp_update script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
This change was originally developed and validated on 202205, 202211 image ([master] Resolve neighbors from config_db sonic-net#15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
The arp_update script now defines kernel neighbors (KERNEIGH4 and KERNEIGH6) based on different device subtype to properly handle DualToR.

Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

What is being fixed
In Non-DualToRs, FAILED/INCOMPLETE neighbors are excluded because this status represents connection issues.
In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
The original code excluded FAILED/INCOMPLETE neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are FAILED in kernel wouldn't be detected as mismatches.
With the fix (post_upgrade), the standby ToR will include FAILED/INCOMPLETE neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel FAILED state and APPL_DB entries.
Example
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE  # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT  # Zero MAC for peer-reachable

Co-authored-by: anish-n <[email protected]>
Signed-off-by: Feng Pan <[email protected]>
r12f pushed a commit to Azure/sonic-buildimage-msft that referenced this pull request Dec 17, 2025
<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->
#### What changed
- This change is being implemented in `arp_update` script to ensure neighbor resolution works properly after firmware upgrades and system repaves.
- This change was originally developed and validated on 202205, 202211 image (sonic-net/sonic-buildimage#15006), and is now being backported to 202305 and newer versions to maintain consistent neighbor resolution across images of all sonic versions.
- The `arp_update` script now defines kernel neighbors (`KERNEIGH4` and `KERNEIGH6`) based on different device subtype to properly handle DualToR.

#### Why I did it
After firmware upgrades/repaves, devices will experience neighbor resolution issues because kernel neighbor table can be empty/missing entries, and hence traffic going to certain neighbors will drop.

#### What is being fixed
- In Non-DualToRs, `FAILED/INCOMPLETE` neighbors are excluded because this status represents connection issues.
- In DualToRs, servers are connected to two ToR switches but only one path is active at a time. When a neighbor is reachable through the peer ToR switch, the local ToR switch will have FAILED/INCOMPLETE neighbor entries, which is an expected behavior.
 - The original code excluded `FAILED/INCOMPLETE` neighbors for all device types, which cause issues on DualToR devices: Neighbors that should be reachable via the peer switch but are `FAILED` in kernel wouldn't be detected as mismatches.
 - With the fix (post_upgrade), the standby ToR will include `FAILED/INCOMPLETE` neighbors in mismatch checking and will be included in synchronization processing since the script can detect the mismatch between kernel `FAILED` state and APPL_DB entries.

#### Example
```
# Immediately after system repave on DualToR standby switch
$ sonic-db-cli APPL_DB keys NEIGH_TABLE:Vlan100:*
NEIGH_TABLE:Vlan100:192.168.1.100
NEIGH_TABLE:Vlan100:192.168.1.101
NEIGH_TABLE:Vlan100:192.168.1.102

# Kernel starts with empty/failed entries
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 FAILED
192.168.1.101 dev Vlan100 FAILED
192.168.1.102 dev Vlan100 FAILED

# With enhanced arp_update script:
# 1. Includes FAILED entries in mismatch detection
# 2. Compares with APPL_DB entries
# 3. Triggers appropriate resolution (ping/tunnel route setup)
# 4. Results in proper neighbor state restoration

# Final state after arp_update processing:
$ ip -4 neigh show | grep Vlan100
192.168.1.100 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
192.168.1.101 dev Vlan100 lladdr aa:bb:cc:dd:ee:ff REACHABLE # Direct reachable
192.168.1.102 dev Vlan100 lladdr 00:00:00:00:00:00 PERMANENT # Zero MAC for peer-reachable
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 202205
- [ ] 202211
- [x] 202305
- [x] 202311
- [x] 202405
- [x] 202411
- [x] 202505

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants