Connect to teamd before adding the lag to STATE_DB by saiarcot895 · Pull Request #3984 · sonic-net/sonic-swss

saiarcot895 · 2025-11-10T22:47:14Z

What I did

It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.)

This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems.

Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon.

Why I did it

How I verified it

Perform several config reloads on a weaker hardware to make sure there are no cases of the LAG entry being added to STATE_DB, and then having teamd exit afterwards.

Details if related

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

mssonicbld · 2025-11-10T22:48:59Z

/azp run

mssonicbld · 2025-11-10T22:48:59Z

/azp run

azure-pipelines · 2025-11-10T22:49:09Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2025-11-10T22:49:09Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull Request Overview

This PR addresses a race condition where teamd might create a port channel interface and then exit due to failure before fully initializing, causing dependent applications to think configuration is complete when it's not. The fix adds a verification step to connect to teamd and verify it's running before adding the LAG entry to STATE_DB.

Key changes:

Added teamdctl connection verification before STATE_DB entry creation
Included libteamdctl library dependency for daemon communication
Reorganized error logging to occur before retry check

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
teamsyncd/teamsync.h	Added teamdctl.h header for teamd daemon communication
teamsyncd/teamsync.cpp	Added teamdctl connection and verification logic to ensure teamd is responsive before STATE_DB updates
teamsyncd/Makefile.am	Added -lteamdctl library dependency to link against libteamdctl

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-10T22:49:08Z

teamsyncd/teamsync.cpp

+#include <team.h>
+#include <teamdctl.h>


Maintainability: These includes are already present in teamsync.h (lines 12-13), which is included on line 13. The duplicate includes are redundant and should be removed.

Suggested change

#include <team.h>

#include <teamdctl.h>

This is a stylistic choice to explicitly include what I'm using in this file, so that if the header files change, this file doesn't require changing.

teamsyncd/teamsync.cpp

Copilot · 2025-11-10T22:49:09Z

teamsyncd/teamsync.cpp

+            }
+
+            char *response;
+            err = teamdctl_config_get_raw_direct(m_teamdctl, &response);


Bug: Memory leak - the response pointer returned by teamdctl_config_get_raw_direct is not freed. According to libteamdctl documentation, the response needs to be freed using teamdctl_config_get_raw_direct_free(response) or free(response) to prevent memory leaks.

https://github.com/jpirko/libteam/blob/master/libteamdctl/libteamdctl.c#L521-L532

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

…5/sonic-swss into fix-teamdsyncd-race-condition

mssonicbld · 2025-11-11T07:39:17Z

/azp run

azure-pipelines · 2025-11-11T07:39:29Z

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph · 2025-11-18T20:48:04Z

teamsyncd/teamsync.cpp

                                   "Unable to register port change event");
            }

+            struct teamdctl *m_teamdctl = teamdctl_alloc();


If teamd exits, shouldn't we get netlink for interface down, as we registered earlier for netlink events in teamsyncd ?

netlink.registerGroup(RTNLGRP_LINK);

We will, but by the time we get the netlink event, portsorch may have already started processing the port channel interface creation.

mssonicbld · 2025-12-03T17:37:20Z

/azp run

azure-pipelines · 2025-12-03T17:37:32Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2025-12-08T17:42:19Z

/azp run

azure-pipelines · 2025-12-08T17:42:30Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-01-15T00:12:49Z

/azp run

azure-pipelines · 2026-01-15T00:12:59Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-01-27T21:19:59Z

/azp run

azure-pipelines · 2026-01-27T21:20:10Z

Azure Pipelines successfully started running 1 pipeline(s).

…dsyncd-race-condition

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

…5/sonic-swss into fix-teamdsyncd-race-condition

mssonicbld · 2026-02-03T07:44:41Z

/azp run

azure-pipelines · 2026-02-03T07:44:51Z

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

mssonicbld · 2026-02-05T22:43:37Z

/azp run

azure-pipelines · 2026-02-05T22:43:48Z

Azure Pipelines successfully started running 1 pipeline(s).

saiarcot895 · 2026-02-20T19:11:33Z

@judyjoseph Please re-review.

teamsyncd/teamsync.cpp

mssonicbld · 2026-02-28T00:38:27Z

/azp run

azure-pipelines · 2026-02-28T00:38:38Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-04T19:04:58Z

/azp run

azure-pipelines · 2026-03-04T19:05:08Z

Azure Pipelines successfully started running 1 pipeline(s).

judyjoseph

LGTM

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

mssonicbld · 2026-03-05T01:37:27Z

/azp run

azure-pipelines · 2026-03-05T01:37:37Z

Azure Pipelines successfully started running 1 pipeline(s).

prsunny · 2026-03-05T17:30:36Z

@judyjoseph , could you signoff?

judyjoseph

LGTM

mssonicbld · 2026-03-17T02:05:07Z

/azp run

azure-pipelines · 2026-03-17T02:05:18Z

Azure Pipelines successfully started running 1 pipeline(s).

What I did It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.) This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems. Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon

Connect to teamd before adding the lag to STATE_DB

f9abf24

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

saiarcot895 requested a review from judyjoseph as a code owner November 10, 2025 22:47

Copilot AI review requested due to automatic review settings November 10, 2025 22:47

Merge branch 'master' into fix-teamdsyncd-race-condition

3dc6420

Copilot started reviewing on behalf of saiarcot895 November 10, 2025 22:47 View session

Copilot finished reviewing on behalf of saiarcot895 November 10, 2025 22:48

Copilot AI reviewed Nov 10, 2025

View reviewed changes

saiarcot895 added 2 commits November 10, 2025 23:38

Bug and style fixes

f9cae00

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Merge branch 'fix-teamdsyncd-race-condition' of github.com:saiarcot89…

8c48f08

…5/sonic-swss into fix-teamdsyncd-race-condition

judyjoseph reviewed Nov 18, 2025

View reviewed changes

saiarcot895 requested a review from judyjoseph November 26, 2025 07:33

Merge branch 'master' into fix-teamdsyncd-race-condition

17ad529

Merge branch 'master' into fix-teamdsyncd-race-condition

fc730f9

Merge branch 'master' into fix-teamdsyncd-race-condition

425500a

Merge branch 'master' into fix-teamdsyncd-race-condition

a880c61

Merge commit 'f39134cbb25b6cf27358437a88de6c55c6dc16a1' into fix-team…

295afa2

…dsyncd-race-condition

saiarcot895 added 2 commits February 2, 2026 23:44

Add some mock tests for teamsyncd

7e84155

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

Merge branch 'fix-teamdsyncd-race-condition' of github.com:saiarcot89…

1304a59

…5/sonic-swss into fix-teamdsyncd-race-condition

saiarcot895 requested a review from prsunny as a code owner February 3, 2026 07:44

Add more tests

0abb535

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

saiarcot895 mentioned this pull request Feb 10, 2026

Add patch to teamd to exit immediately if signalling the parent fails sonic-net/sonic-buildimage#23615

Open

9 tasks

judyjoseph reviewed Feb 24, 2026

View reviewed changes

teamsyncd/teamsync.cpp Show resolved Hide resolved

Merge branch 'master' into fix-teamdsyncd-race-condition

d6b1810

Merge branch 'master' into fix-teamdsyncd-race-condition

2e3f35c

judyjoseph previously approved these changes Mar 4, 2026

View reviewed changes

Fix Trixie build error

1edff3c

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>

saiarcot895 dismissed judyjoseph’s stale review via 1edff3c March 5, 2026 01:37

saiarcot895 added the Request for 202511 Branch label Mar 5, 2026

Merge branch 'master' into fix-teamdsyncd-race-condition

ca35fd3

judyjoseph approved these changes Mar 17, 2026

View reviewed changes

prsunny merged commit c8c05e6 into sonic-net:master Mar 17, 2026
19 checks passed

Conversation

saiarcot895 commented Nov 10, 2025

Uh oh!

mssonicbld commented Nov 10, 2025

Uh oh!

mssonicbld commented Nov 10, 2025

Uh oh!

azure-pipelines bot commented Nov 10, 2025

Uh oh!

azure-pipelines bot commented Nov 10, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

saiarcot895 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

saiarcot895 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Nov 11, 2025

Uh oh!

azure-pipelines bot commented Nov 11, 2025

Uh oh!

judyjoseph Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

saiarcot895 Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Dec 3, 2025

Uh oh!

azure-pipelines bot commented Dec 3, 2025

Uh oh!

mssonicbld commented Dec 8, 2025

Uh oh!

azure-pipelines bot commented Dec 8, 2025

Uh oh!

mssonicbld commented Jan 15, 2026

Uh oh!

azure-pipelines bot commented Jan 15, 2026

Uh oh!

mssonicbld commented Jan 27, 2026

Uh oh!

azure-pipelines bot commented Jan 27, 2026

Uh oh!

mssonicbld commented Feb 3, 2026

Uh oh!

azure-pipelines bot commented Feb 3, 2026

Uh oh!

mssonicbld commented Feb 5, 2026

Uh oh!

azure-pipelines bot commented Feb 5, 2026

Uh oh!

saiarcot895 commented Feb 20, 2026

Uh oh!

Uh oh!

mssonicbld commented Feb 28, 2026

Uh oh!

azure-pipelines bot commented Feb 28, 2026

Uh oh!

mssonicbld commented Mar 4, 2026

Uh oh!

azure-pipelines bot commented Mar 4, 2026

Uh oh!

judyjoseph left a comment

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Mar 5, 2026

Uh oh!

azure-pipelines bot commented Mar 5, 2026