Connect to teamd before adding the lag to STATE_DB#3984
Connect to teamd before adding the lag to STATE_DB#3984prsunny merged 16 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azp run |
1 similar comment
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull Request Overview
This PR addresses a race condition where teamd might create a port channel interface and then exit due to failure before fully initializing, causing dependent applications to think configuration is complete when it's not. The fix adds a verification step to connect to teamd and verify it's running before adding the LAG entry to STATE_DB.
Key changes:
- Added teamdctl connection verification before STATE_DB entry creation
- Included libteamdctl library dependency for daemon communication
- Reorganized error logging to occur before retry check
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| teamsyncd/teamsync.h | Added teamdctl.h header for teamd daemon communication |
| teamsyncd/teamsync.cpp | Added teamdctl connection and verification logic to ensure teamd is responsive before STATE_DB updates |
| teamsyncd/Makefile.am | Added -lteamdctl library dependency to link against libteamdctl |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| #include <team.h> | ||
| #include <teamdctl.h> |
There was a problem hiding this comment.
Maintainability: These includes are already present in teamsync.h (lines 12-13), which is included on line 13. The duplicate includes are redundant and should be removed.
| #include <team.h> | |
| #include <teamdctl.h> |
There was a problem hiding this comment.
This is a stylistic choice to explicitly include what I'm using in this file, so that if the header files change, this file doesn't require changing.
| } | ||
|
|
||
| char *response; | ||
| err = teamdctl_config_get_raw_direct(m_teamdctl, &response); |
There was a problem hiding this comment.
Bug: Memory leak - the response pointer returned by teamdctl_config_get_raw_direct is not freed. According to libteamdctl documentation, the response needs to be freed using teamdctl_config_get_raw_direct_free(response) or free(response) to prevent memory leaks.
There was a problem hiding this comment.
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…5/sonic-swss into fix-teamdsyncd-race-condition
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| "Unable to register port change event"); | ||
| } | ||
|
|
||
| struct teamdctl *m_teamdctl = teamdctl_alloc(); |
There was a problem hiding this comment.
If teamd exits, shouldn't we get netlink for interface down, as we registered earlier for netlink events in teamsyncd ?
netlink.registerGroup(RTNLGRP_LINK);
There was a problem hiding this comment.
We will, but by the time we get the netlink event, portsorch may have already started processing the port channel interface creation.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…dsyncd-race-condition
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
…5/sonic-swss into fix-teamdsyncd-race-condition
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@judyjoseph Please re-review. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@judyjoseph , could you signoff? |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
What I did It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.) This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems. Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon
What I did
It's possible for teamd to start up, create the port channel interface in the kernel, and then later exit because some failure condition was hit, or initialization took too long (and so the parent process killed it), or something else. When this happens, teamsyncd will get a notification from the kernel saying that the port channel interface has been created, and will start to add it into STATE_DB. If this happens before teamd goes down (and the port channel interface gets removed from the kernel), then anything depending on that STATE_DB entry will begin its processing, not realizing that that interface will get removed. This will result in dependent applications thinking everything has been processed and configs have been applied, but they haven't really been applied. (In the case of LAGs, this will be intfmgrd adding the IP address to the interface.)
This is functionally a race condition between teamd creating and then deleting the interface (due to a failure condition), teamsyncd acting too fast, and dependent applications assuming all setup is complete. This race condition is more visible on weaker systems.
Therefore, to try to prevent it, before adding an entry in STATE_DB, make sure that teamsyncd can get information from the kernel about the port channel interface, and then directly connect to teamd and make sure that the teamd is processing requests. If both of these succeed, then it can be assumed that all setup is done, and that teamd won't be exiting soon.
Why I did it
How I verified it
Perform several config reloads on a weaker hardware to make sure there are no cases of the LAG entry being added to STATE_DB, and then having teamd exit afterwards.
Details if related