Add lldpd patch to fix incomplete interface detection#25436
Add lldpd patch to fix incomplete interface detection#25436yejianquan merged 2 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Zhaohui Sun <[email protected]>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Adds an upstream lldpd netlink fix (for 1.0.16 packaging in sonic-buildimage) to prevent missing interfaces during initial RTM_GETLINK/RTM_GETADDR dumps when concurrent RTM_NEWLINK events arrive during boot/config reload.
Changes:
- Updates the lldpd patch series metadata and includes a new patch in the series.
- Adds a patch that separates netlink “query/dump” traffic from “change notification” traffic by using two netlink sockets.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/lldpd/patch/series |
Registers the new lldpd netlink fix patch in the applied patch stack. |
src/lldpd/patch/0002-use-a-different-socket-for-changes-and-queries.patch |
Implements the upstream netlink socket separation fix to avoid incomplete interface enumeration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } | ||
| - cfg->g_netlink->nl_socket = s; | ||
| + | ||
| + /* Opening Netlink socket to for queries */ |
There was a problem hiding this comment.
Typo in the added comment text: “Opening Netlink socket to for queries” should be “Opening Netlink socket for queries” to avoid confusion in the upstreamed patch.
| + /* Opening Netlink socket to for queries */ | |
| + /* Opening Netlink socket for queries */ |
| + /* Open Netlink socket for subscriptions */ | ||
| + log_debug("netlink", "opening netlink sockets"); | ||
| + s1 = socket(AF_NETLINK, SOCK_RAW, NETLINK_ROUTE); | ||
| + if (s1 == -1) { | ||
| + log_warn("netlink", "unable to open netlink socket for changes"); | ||
| + goto error; |
There was a problem hiding this comment.
The comment says “Open Netlink socket for subscriptions”, but this socket is specifically used for change notifications (and the log/error strings call it “changes”). Consider updating the comment wording to match the actual role so future readers don’t confuse it with the query socket.
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@r12f could you please help review? |
|
Cherry-pick PR to msft-202412: Azure/sonic-buildimage-msft#1995 |
|
Cherry-pick PR to 202505: #25500 |
Fixes sonic-net#22376. Adds a new test case that verifies LLDP neighbors are correctly restored after config reload, catching regressions where lldpd fails to detect all interfaces during boot (e.g. the race condition fixed in sonic-net/sonic-buildimage#25436). The test validates: 1. All LLDP neighbor entries present before reload are restored after 2. Chassis ID type is 'mac' (not hostname fallback from missing eth0) 3. All interfaces appear in 'lldpcli show interfaces' 4. Checks syslog for 'cannot find port' / lldpmgrd errors (warning only) Signed-off-by: Rustiqly <[email protected]> Signed-off-by: Sonicly1G <[email protected]>
Fixes sonic-net#22376. Adds a new test case that verifies LLDP neighbors are correctly restored after config reload, catching regressions where lldpd fails to detect all interfaces during boot (e.g. the race condition fixed in sonic-net/sonic-buildimage#25436). The test validates: 1. All LLDP neighbor entries present before reload are restored after 2. Chassis ID type is 'mac' (not hostname fallback from missing eth0) 3. All interfaces appear in 'lldpcli show interfaces' 4. Checks syslog for 'cannot find port' / lldpmgrd errors (warning only) Signed-off-by: Rustiqly <[email protected]> Signed-off-by: Sonicly1G <[email protected]>
Verify LLDP neighbors are fully restored after config reload, including: - All pre-reload LLDP neighbors are present post-reload - Neighbor names match pre-reload state - Chassis ID type is MAC (not hostname) - Chassis MAC matches management interface MAC - lldpcli show interfaces lists all expected ports - Syslog checked for 'cannot find port' errors Addresses test gap issue sonic-net#22376. Related PR: sonic-net/sonic-buildimage#25436 Signed-off-by: Ying Xie <[email protected]>
Verify LLDP neighbors are fully restored after config reload, including: - All pre-reload LLDP neighbors are present post-reload - Neighbor names match pre-reload state - Chassis ID type is MAC (not hostname) - Chassis MAC matches management interface MAC - lldpcli show interfaces lists all expected ports - Syslog checked for 'cannot find port' errors Addresses test gap issue sonic-net#22376. Related PR: sonic-net/sonic-buildimage#25436 Signed-off-by: Ying Xie <[email protected]>
Verify LLDP neighbors are fully restored after config reload, including: - All pre-reload LLDP neighbors are present post-reload - Neighbor names match pre-reload state - Chassis ID type is MAC (not hostname) - Chassis MAC matches management interface MAC - lldpcli show interfaces lists all expected ports - Syslog checked for 'cannot find port' errors Addresses test gap issue sonic-net#22376. Related PR: sonic-net/sonic-buildimage#25436 Signed-off-by: Ying Xie <[email protected]>
) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue #22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <[email protected]>
|
@vmittal-msft could you please help approve the cherry-pick to 202511? |
…ic-net#22562) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue sonic-net#22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <[email protected]> Signed-off-by: Mihut Aronovici <[email protected]>
…ic-net#22562) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue sonic-net#22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <[email protected]> Signed-off-by: Raghavendran Ramanathan <[email protected]>
Why I did it
On 5640 full topology testbed, during system boots up, 457(456 Ethernet + eth0) interfaces are booting up, lldpd is initializing too.
when lldpd initializing, it will send RTM_GETLINK dump to get all interfaces, but during this period, some new interfaces are boots up, lldpd subscribes async notification of netlink update(levent_iface_subscribe).
Since queries and changes are using same sokcet cfg->g_netlink->nl_socket, previous RTM_GETLINK dump interfaces(netlink_recv RTM_GETLINK) is impacted by the new RTM_NEWLINK messages handling process(netlink_change_cb).
About 200+ interfaces are missing in lldp neighbor. Only 200+ interfaces exist which have RTM_NEWLINK arrived after lldpd initialization.
Phenomenon:
Incorrect Chassis ID - Chassis ID being incorrect, should be mac but show hostname instead
a. Fail to find eth0 and fallback to hostname.
WARNING lldp#lldpcli[29]: cannot find port eth0
lldpcli config failure - The port is up already, but later lldp cannot find the port, hence missing port up events and never be able to recover. The symptom will be both sides are missing lldp entries.
a. <11>2026-02-05T04:18:42.052245+00:00 ATL21-0101-0014-12BT0 ERR lldp#lldpmgrd[38]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet501', 'lldp', 'portidsubtype', 'local', 'etp63f', 'description', 'ATL210101580129:A1.PORT8']': 2026-02-05T04:18:42 [WARN/lldpctl] cannot find port Ethernet501#012 - command was failed 6 times, disabling retry
Work item tracking
Microsoft ADO 36610002:
How I did it
now sonic is using lldpd 1.0.16.
There is a known issue in lldpd community: In some cases lldpd cannot get all interfaces · Issue sonic-net#611 · lldpd/lldpd
And it's fixed, but no tag for 1.0.16 :daemon/netlink: use a different socket for changes and queries · lldpd/lldpd@88fe3fa
Add this commit as a new patch for sonic to fix this issue.
How to verify it
config reload can repro this issue easily on 5640 full topology testbed.
Try run "config reload" and verify if all lldp neighbors are up.
Signed-off-by: Zhaohui Sun <[email protected]>
Signed-off-by: Feng Pan <[email protected]>
|
Hi @ZhaohuiS is there a 202511 cherry-pick PR for this? |
|
@tirupatihemanth Yes, we need it in 202511, @vmittal-msft please help approve this cherrypick. |
|
Cherry-pick PR to 202511: #26011 |
…ic-net#22562) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue sonic-net#22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <[email protected]> Signed-off-by: Abhishek <[email protected]>
Why I did it
On 5640 full topology testbed, during system boots up, 457(456 Ethernet + eth0) interfaces are booting up, lldpd is initializing too.
when lldpd initializing, it will send RTM_GETLINK dump to get all interfaces, but during this period, some new interfaces are boots up, lldpd subscribes async notification of netlink update(levent_iface_subscribe).
Since queries and changes are using same sokcet cfg->g_netlink->nl_socket, previous RTM_GETLINK dump interfaces(netlink_recv RTM_GETLINK) is impacted by the new RTM_NEWLINK messages handling process(netlink_change_cb).
About 200+ interfaces are missing in lldp neighbor. Only 200+ interfaces exist which have RTM_NEWLINK arrived after lldpd initialization.
Phenomenon:
Incorrect Chassis ID - Chassis ID being incorrect, should be mac but show hostname instead
a. Fail to find eth0 and fallback to hostname.
WARNING lldp#lldpcli[29]: cannot find port eth0
lldpcli config failure - The port is up already, but later lldp cannot find the port, hence missing port up events and never be able to recover. The symptom will be both sides are missing lldp entries.
a. <11>2026-02-05T04:18:42.052245+00:00 ATL21-0101-0014-12BT0 ERR lldp#lldpmgrd[38]: Command failed '['lldpcli', 'configure', 'ports', 'Ethernet501', 'lldp', 'portidsubtype', 'local', 'etp63f', 'description', 'ATL210101580129:A1.PORT8']': 2026-02-05T04:18:42 [WARN/lldpctl] cannot find port Ethernet501#012 - command was failed 6 times, disabling retry
Work item tracking
Microsoft ADO 36610002:
How I did it
now sonic is using lldpd 1.0.16.
There is a known issue in lldpd community: In some cases lldpd cannot get all interfaces · Issue #611 · lldpd/lldpd
And it's fixed, but no tag for 1.0.16 :daemon/netlink: use a different socket for changes and queries · lldpd/lldpd@88fe3fa
Add this commit as a new patch for sonic to fix this issue.
How to verify it
config reload can repro this issue easily on 5640 full topology testbed.
Try run "config reload" and verify if all lldp neighbors are up.
Signed-off-by: Zhaohui Sun <[email protected]>
Signed-off-by: dprital <[email protected]>
…ncomplete scenarios (#22420) What is the motivation for this PR? To add 2 new test cases to check if all lldp neighbors exist and if all interfaces exists in lldpcli show interfaces. We don't have a test case to cover the issue we found in PR: sonic-net/sonic-buildimage#25436 #22562 cover some scenarios, but not all of them. So I removed the test case in #22562 and add 2 test cases with more checks in this PR. How did you do it? 1. test_lldp_interfaces Purpose: Verify LLDP functionality across all interfaces without config reload Key Features: Validates LLDP table completeness Verifies lldpcli interface discovery Checks lldpctl_facts consistency Validates chassis ID and capabilities Active syslog monitoring via loganalyzer Test Steps: Recording Phase: Capture all interfaces from show interface status LLDP Table Verification: Compare LLDP table with interface status (admin up, no PortChannels) lldpcli Verification: Compare lldpcli output with interface status (all interfaces, no PortChannels) lldpctl_facts Verification: Compare lldpctl_facts with interface status (admin up, no PortChannels) Consistency Check: Verify all lldpctl_facts interfaces exist in lldpcli Chassis Verification: Validate chassis MAC address and capabilities (Bridge: on, Router: on, Wlan: off, Station: off) 2. test_lldp_interface_config_reload Purpose: Verify LLDP functionality persists correctly after config reload Key Features: Tests LLDP recovery after config reload Validates interface persistence Ensures neighbor rediscovery Monitors for LLDP-specific errors while ignoring expected reload errors Addresses the core issue from GitHub [test gap] check lldp neighbors after config reload #22376 Test Steps: Pre-Reload Recording: Capture all interfaces before config reload Config Reload: Perform safe config reload with interface checks Stabilization Wait: Wait for critical services and LLDP convergence LLDP Table Verification: Compare LLDP table with pre-reload interfaces (admin up, no PortChannels) lldpcli Verification: Compare lldpcli output with pre-reload interfaces (all interfaces, no PortChannels) lldpctl_facts Verification: Compare lldpctl_facts with pre-reload interfaces (admin up, no PortChannels) Consistency Check: Verify all lldpctl_facts interfaces exist in lldpcli Chassis Verification: Validate chassis MAC and capabilities remain correct How did you verify/test it? run it on testbed. Signed-off-by: Zhaohui Sun <[email protected]> Co-authored-by: Copilot <[email protected]>
…ic-net#22562) What is the motivation for this PR? Add a new test case to verify LLDP neighbors are fully restored after config reload. Addresses test gap issue sonic-net#22376. Related fix PR: sonic-net/sonic-buildimage#25436 How did you do it? Added test_lldp_after_config_reload to tests/lldp/test_lldp.py that: 1. Records LLDP neighbors before config reload 2. Performs config reload (safe_reload) 3. Waits for LLDP neighbors to be restored 4. Verifies all neighbors present with matching names 5. Verifies Chassis ID type is MAC 6. Verifies Chassis MAC matches management interface MAC 7. Checks lldpcli show interfaces and syslog for errors How did you verify/test it? lldp/test_lldp.py::test_lldp_after_config_reload[vlab-01-None] PASSED 1 passed, 83 warnings in 221.39s (0:03:41) Tested on KVM testbed (T0, converged peers). Signed-off-by: Ying Xie <[email protected]> Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
…ncomplete scenarios (sonic-net#22420) What is the motivation for this PR? To add 2 new test cases to check if all lldp neighbors exist and if all interfaces exists in lldpcli show interfaces. We don't have a test case to cover the issue we found in PR: sonic-net/sonic-buildimage#25436 sonic-net#22562 cover some scenarios, but not all of them. So I removed the test case in sonic-net#22562 and add 2 test cases with more checks in this PR. How did you do it? 1. test_lldp_interfaces Purpose: Verify LLDP functionality across all interfaces without config reload Key Features: Validates LLDP table completeness Verifies lldpcli interface discovery Checks lldpctl_facts consistency Validates chassis ID and capabilities Active syslog monitoring via loganalyzer Test Steps: Recording Phase: Capture all interfaces from show interface status LLDP Table Verification: Compare LLDP table with interface status (admin up, no PortChannels) lldpcli Verification: Compare lldpcli output with interface status (all interfaces, no PortChannels) lldpctl_facts Verification: Compare lldpctl_facts with interface status (admin up, no PortChannels) Consistency Check: Verify all lldpctl_facts interfaces exist in lldpcli Chassis Verification: Validate chassis MAC address and capabilities (Bridge: on, Router: on, Wlan: off, Station: off) 2. test_lldp_interface_config_reload Purpose: Verify LLDP functionality persists correctly after config reload Key Features: Tests LLDP recovery after config reload Validates interface persistence Ensures neighbor rediscovery Monitors for LLDP-specific errors while ignoring expected reload errors Addresses the core issue from GitHub [test gap] check lldp neighbors after config reload sonic-net#22376 Test Steps: Pre-Reload Recording: Capture all interfaces before config reload Config Reload: Perform safe config reload with interface checks Stabilization Wait: Wait for critical services and LLDP convergence LLDP Table Verification: Compare LLDP table with pre-reload interfaces (admin up, no PortChannels) lldpcli Verification: Compare lldpcli output with pre-reload interfaces (all interfaces, no PortChannels) lldpctl_facts Verification: Compare lldpctl_facts with pre-reload interfaces (admin up, no PortChannels) Consistency Check: Verify all lldpctl_facts interfaces exist in lldpcli Chassis Verification: Validate chassis MAC and capabilities remain correct How did you verify/test it? run it on testbed. Signed-off-by: Zhaohui Sun <[email protected]> Co-authored-by: Copilot <[email protected]> Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
Why I did it
On 5640 full topology testbed, during system boots up, 457(456 Ethernet + eth0) interfaces are booting up, lldpd is initializing too.
when lldpd initializing, it will send RTM_GETLINK dump to get all interfaces, but during this period, some new interfaces are boots up, lldpd subscribes async notification of netlink update(
levent_iface_subscribe).Since queries and changes are using same
sokcet cfg->g_netlink->nl_socket,previousRTM_GETLINKdump interfaces(netlink_recv RTM_GETLINK) is impacted by the new RTM_NEWLINK messages handling process(netlink_change_cb).About 200+ interfaces are missing in lldp neighbor. Only 200+ interfaces exist which have RTM_NEWLINK arrived after lldpd initialization.
Incorrect Chassis ID - Chassis ID being incorrect, should be mac but show hostname instead
lldpcli config failure - The port is up already, but later lldp cannot find the port, hence missing port up events and never be able to recover. The symptom will be both sides are missing lldp entries.
Work item tracking
How I did it
now sonic is using lldpd 1.0.16.
There is a known issue in lldpd community: In some cases lldpd cannot get all interfaces · Issue #611 · lldpd/lldpd
And it's fixed, but no tag for 1.0.16 :daemon/netlink: use a different socket for changes and queries · lldpd/lldpd@88fe3fa
Add this commit as a new patch for sonic to fix this issue.
How to verify it
config reload can repro this issue easily on 5640 full topology testbed.
Try run "config reload" and verify if all lldp neighbors are up.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)