[nbrmgr]: Reconcile pending NEIGH_RESOLVE_TABLE entries on restart by prsunny · Pull Request #14 · prsunny/sonic-swss

prsunny · 2026-03-17T00:10:14Z

On nbrmgr restart, pre-existing entries in NEIGH_RESOLVE_TABLE were not being processed because ConsumerStateTable only delivers entries written after subscription. Added reconciliation logic that scans the table at startup and resolves any pending neighbor entries.

Changes:

cfgmgr/nbrmgr.cpp/h: Added reconcileNeighResolveTable() called from constructor
tests/mock_tests/nbrmgrd/nbrmgr_ut.cpp: Unit tests for empty, IPv4, and IPv6 reconciliation
tests/mock_tests/Makefile.am: Added tests_nbrmgrd test binary

…ction. Signed-off-by: Bibhuprasad Singh <bibhuprasad@google.com>

…branch3 [P4Orch] Add P4MulticastRouterInterfaceEntry that use the no_action action.

On nbrmgr restart, pre-existing entries in NEIGH_RESOLVE_TABLE were not being processed because ConsumerStateTable only delivers entries written after subscription. Added reconciliation logic that scans the table at startup and resolves any pending neighbor entries. - Added reconcileNeighResolveTable() method called from constructor - Added unit tests for empty, IPv4, and IPv6 reconciliation scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Prince Sunny <prince.sunny@microsoft.com>

…nic-net#4340) The test_sub_port_intf tests assume there is exactly one virtual router (the default VRF) in ASIC_DB at startup. This assumption breaks when other features (VRFs, VNETs, loopback-action, or prior tests) create additional virtual routers, causing get_default_vrf_oid() to fail with: AssertionError: Wrong # of default vrfs: 2, expected #: 1. This has been an open issue since July 2022 (sonic-net#2365) and continues to block unrelated PRs across repos. Fix: - Query SAI_SWITCH_ATTR_DEFAULT_VIRTUAL_ROUTER_ID from the switch object in ASIC_DB to reliably identify the default VRF, regardless of how many virtual routers exist. - Track initial_vrf_count at setup and use relative counts for wait_for_n_keys assertions instead of hardcoded 1/2 values. Fixes: sonic-net#2365 Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

What I did It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.) This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems. Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon

* [swsconfig] Add custom ZMQ endpoint Signed-off-by: Vivek Reddy <vkarri@nvidia.com> * Use DB based on endpoint Signed-off-by: Vivek Reddy <vkarri@nvidia.com> --------- Signed-off-by: Vivek Reddy <vkarri@nvidia.com> Co-authored-by: Liat Grozovik <44433539+liat-grozovik@users.noreply.github.com> Co-authored-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>

…onic-net#4015) What I did This PR updates bufferorch logic for VOQ-based systems to: Stop performing per-queue port reference counting when applying buffer queue configurations on VOQ switches (system ports are static and do not require dynamic ref tracking). Correct the appl DB key transformation for VOQ buffer ready list initialization to use the standard delimiter consistently instead of mixing config_db_key_delimiter and delimiter. Why I did it To avoid ref count issues when there is port speed change

- Added tests for invalid key format and setNeighbor failure paths - Fixed key validation to check for ':' delimiter directly - Added nlmsg_alloc wrap to simulate netlink allocation failure Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Prince Sunny <prince.sunny@microsoft.com>

What I did Increase VRF pool size from 999 to 4096 to support hardware platforms with higher VRF scale capabilities. Why I did it Current VRF pool is limited to 999 instances (VRF_TABLE_END=2000, VRF_TABLE_START=1001) which artificially restricts VRF scale even on hardware platforms that support more VRFs.

What I did advance to the next netlink message using 4-byte alignment instead of raw nlmsg_len keep payload extraction bounded by the original message length (exclude padding) add a unit test that covers multiple messages in one recv buffer where the first message requires netlink padding Why Linux netlink messages are 4-byte aligned, and the canonical NLMSG_NEXT() macro advances by NLMSG_ALIGN(nlmsg_len). Advancing by raw nlmsg_len can misread the next header when a recv buffer contains multiple messages and one message length is not already aligned.

…SAI instead (sonic-net#4304) What I did Removed the IPFIX template buffer size estimation logic in HFTelProfile::updateTemplates() and replaced it with the standard SAI two-phase query pattern: first query the required size with count=0 / list=nullptr, then allocate and fetch. Why I did it The previous estimation logic used hardcoded constants to predict the IPFIX template buffer size, but could underestimate (e.g., estimated 65535 bytes vs. actual 119352 bytes). This caused the SDK to log ERR-level messages on the first attempt: ERR syncd#SDK: [SAI_TAM.ERR] mlnx_generate_ipfix_templates: Buffer size is too small to hold IPFIX template [size:65535, required:119352]. ERR syncd#SDK: [SAI_TAM.ERR] mlnx_tam_tel_type_get_ipfix_templates: Failed to generate IPFIX templates. ERR syncd#SDK: [SAI_TAM.ERR] mlnx_tam_tel_type_attrib_get: Failed to get attribute. ERR syncd#SDK: [SAI_UTILS.ERR] get_dispatch_attribs_handler: Failed Get #0, IPFIX_TEMPLATES, key:TAM_TEL_TYPE Although the retry with SAI_STATUS_BUFFER_OVERFLOW worked correctly, these error logs caused test_hft_full_queue_counters to fail in LogAnalyzer teardown. By always querying the size first (count=0, list=nullptr), we avoid the unreliable estimation entirely and follow the idiomatic SAI pattern for variable-length attributes.

…#4041) What I did Expand/Double SAI_REDIS_SYNC_OPERATION_RESPONSE_TIMEOUT for armhf Nokia-7215 Why I did it Fix for issue sonic-net/sonic-buildimage#24765

- Wrap reconciliation loop body in try-catch for invalid IP handling - Use getTableNameSeparator() instead of hardcoded ':' Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Prince Sunny <prince.sunny@microsoft.com>

) This PR follows up on prior review feedback by refactoring serdes (signal integrity) programming in PortsOrch::doPortTask() to reuse a single helper for ASIC and gearbox cases, and by fixing a unit-test SAI stub to correctly clear port→serdes state on serdes removal. Changes: Refactor serdes programming into a local helper lambda to share admin-down + programming logic across ASIC and gearbox line/system sides. Update the unit-test SAI remove_port_serdes stub to erase any port→serdes mappings referencing the removed serdes OID.

…okup (sonic-net#4347) What I did Added a missing end() check before dereferencing a find() result in VNetRouteOrch::doRouteTask(). Why I did it syncd_tunnel_routes_[vnet].find(ipPrefix) is used without checking the iterator against end(). If the route is absent from syncd_tunnel_routes_ while custom_monitor_ep_updated is true (e.g. due to a race between route deletion and monitor update), the unguarded dereference of it_route->second causes undefined behavior and can crash orchagent.

Move -Wl,-wrap flags from CXXFLAGS to LDFLAGS to prevent compilation issues. These are linker flags and should not be passed to the compiler. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Prince Sunny <prince.sunny@microsoft.com>

[P4Orch] Add P4MulticastRouterInterfaceEntry that use the no_action a…

9923d9c

…ction. Signed-off-by: Bibhuprasad Singh <bibhuprasad@google.com>

prsunny force-pushed the nbrmgr-neigh-resolve-reconcile branch from 94694c0 to 3ddfebe Compare March 17, 2026 00:20

StephenWangGoogle and others added 2 commits March 16, 2026 18:14

Merge pull request sonic-net#4230 from bibhuprasad-hcl/bibhu_swss_p4_…

f7ea4a8

…branch3 [P4Orch] Add P4MulticastRouterInterfaceEntry that use the no_action action.

prsunny force-pushed the nbrmgr-neigh-resolve-reconcile branch from 3ddfebe to e662bac Compare March 17, 2026 01:15

rustiqly and others added 6 commits March 17, 2026 09:11

Merge branch 'master' into nbrmgr-neigh-resolve-reconcile

49abb41

prsunny force-pushed the nbrmgr-neigh-resolve-reconcile branch from 5c08a0f to ea9bb6d Compare March 18, 2026 20:11

prsunny and others added 6 commits March 18, 2026 13:14

Merge branch 'master' into nbrmgr-neigh-resolve-reconcile

0339cbd

SAI_REDIS_SYNC_OPERATION_RESPONSE_TIMEOUT expand for armhf (sonic-net…

c20ded7

…#4041) What I did Expand/Double SAI_REDIS_SYNC_OPERATION_RESPONSE_TIMEOUT for armhf Nokia-7215 Why I did it Fix for issue sonic-net/sonic-buildimage#24765

Fix makefile duplicate entries

70c60b7

prsunny force-pushed the nbrmgr-neigh-resolve-reconcile branch from 27741f3 to 70c60b7 Compare March 19, 2026 17:18

prsunny and others added 7 commits March 19, 2026 18:35

Merge branch 'master' into nbrmgr-neigh-resolve-reconcile

64f8b47

Merge branch 'master' into nbrmgr-neigh-resolve-reconcile

6a90f14

Fix unit tests, revert CXXFLAGS

102cfa9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nbrmgr]: Reconcile pending NEIGH_RESOLVE_TABLE entries on restart#14

[nbrmgr]: Reconcile pending NEIGH_RESOLVE_TABLE entries on restart#14
prsunny wants to merge 22 commits intomasterfrom
nbrmgr-neigh-resolve-reconcile

prsunny commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

Conversation

prsunny commented Mar 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants