Skip to content

Add patch to teamd to exit immediately if signalling the parent fails#23615

Open
saiarcot895 wants to merge 2 commits intosonic-net:masterfrom
saiarcot895:master-fix-teamsyncd-race-condition
Open

Add patch to teamd to exit immediately if signalling the parent fails#23615
saiarcot895 wants to merge 2 commits intosonic-net:masterfrom
saiarcot895:master-fix-teamsyncd-race-condition

Conversation

@saiarcot895
Copy link
Copy Markdown
Contributor

@saiarcot895 saiarcot895 commented Aug 6, 2025

Why I did it

When teamd is starting, on weaker overloaded systems, it may take a long time for it to fully initialize. When that happens, teammgrd might get a non-zero exit code from the parent teamd process (because it hasn't gotten a success signal from the child teamd process), and start the process of restarting teamd for that port channel. Meanwhile, the child teamd process might finish initialization, send a success signal, and stay up and process other requests. This can cause issues/race conditions if other applications start their configuration because they think that teamd is up.

This is related to sonic-net/sonic-swss#3984.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Avoid the above issue by exiting immediately if it fails to signal the parent process that it has completed initialization. This suggests that it took too long and a failure signal to teammgrd was already sent before this point.

How to verify it

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895 saiarcot895 marked this pull request as ready for review February 10, 2026 00:20
@saiarcot895 saiarcot895 requested a review from lguohan as a code owner February 10, 2026 00:20
Copilot AI review requested due to automatic review settings February 10, 2026 00:20
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a SONiC-specific libteam/teamd patch to prevent a race during slow teamd initialization by ensuring the child process does not continue running if it can’t signal successful startup to its parent.

Changes:

  • Add a new libteam patch to skip entering the processing loop when daemon_retval_send(0) fails.
  • Register the new patch in the libteam patch series so it is applied during build.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/libteam/patch/series Adds the new patch to the quilt series list.
src/libteam/patch/0018-exit-on-failure-to-signal.patch Updates teamd_start() to avoid running the main loop if notifying the parent fails.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


When the teamd daemon initialization completes, if signalling the parent
process fails, then immediately exit. For SONiC's purposes, if the
signalling fails, then teamdmgr will assume it to have failed and will
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The patch description refers to "teamdmgr" but the SONiC daemon appears to be named "teammgrd". Please fix the name here to avoid confusion when someone reads this patch header later.

Suggested change
signalling fails, then teamdmgr will assume it to have failed and will
signalling fails, then teammgrd will assume it to have failed and will

Copilot uses AI. Check for mistakes.
Comment on lines +31 to +36
- err = teamd_run_loop_run(ctx);
+ err = daemon_retval_send(0);
+ if (err) {
+ teamd_log_err("unable to send response, likely timed out.");
+ } else {
+ teamd_log_info(PACKAGE_VERSION" successfully started.");
Copy link

Copilot AI Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When daemon_retval_send(0) fails, this logs a speculative message ("likely timed out") and leaves err as the raw return value (often -1), which loses the underlying errno and may not match the rest of teamd_start's negative-errno error convention. Consider logging the actual error (e.g., including errno/strerror) and translating the return to a meaningful negative errno value so callers/exit handling get a consistent error code.

Copilot uses AI. Check for mistakes.
@saiarcot895
Copy link
Copy Markdown
Contributor Author

@judyjoseph Please review.

Copy link
Copy Markdown
Contributor

@judyjoseph judyjoseph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@judyjoseph
Copy link
Copy Markdown
Contributor

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895
Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants