[swss] Restart countersyncd after unexpected exit by Pterosaur · Pull Request #26147 · sonic-net/sonic-buildimage

Pterosaur · 2026-03-12T13:15:11Z

Why I did it

swss:countersyncd is currently started through dependent-startup, but its supervisor stanza uses autorestart=false.
As a result, when countersyncd exits unexpectedly, supervisor leaves it in EXITED instead of recovering it.
That behavior was reproduced on str4-sn5640-2, where HFT telemetry validation could pass while countersyncd still stayed down afterward.

Work item tracking

Microsoft ADO (number only): N/A

How I did it

change countersyncd from autorestart=false to autorestart=unexpected
add startsecs=10
add startretries=3

This keeps the existing dependent-startup flow, allows supervisor to recover countersyncd after unexpected exits, and gives startretries a meaningful guardrail for short-lived crash loops before a process is considered stably started.

How to verify it

Apply the countersyncd supervisor stanza change on the DUT swss container.

Run:

./run_tests.sh -u -n vms70-t0-sn5640-2 -i ../ansible/str4,../ansible/veos -l info -m individual -e --skip_sanity -e --disable_loganalyzer -c "high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters"

Confirm:
- HFT telemetry validation still passes
- docker exec swss supervisorctl status countersyncd reports RUNNING after the test

Observed result from live DUT validation:

DUT: str4-sn5640-2
Platform: x86_64-nvidia_sn5640-r0 / Mellanox-SN5640-C512S2
Image: 20241212.54
Test result: 1 passed

Which release branch to backport (provide reason below if selected)

Backport reason:

This is a bug fix for service recovery behavior in swss.
The same PR is also requested for msft-202412 via label because unexpected countersyncd exits leave the supervisor-managed service down until manual intervention.

Tested branch (Please provide the tested image version)

20241212.54 (live-patched DUT validation on str4-sn5640-2)

Description for the changelog

Restart countersyncd automatically after unexpected exits in swss.

Link to config_db schema for YANG module changes

N/A

A picture of a cute animal (not mandatory but encouraged)

🦦

Signed-off-by: Ze Gan <ganze718@gmail.com>

Copilot

Pull request overview

Adjusts the swss container’s supervisor configuration so countersyncd can recover automatically after unexpected exits, improving telemetry robustness without changing the dependent-startup ordering.

Changes:

Switch countersyncd from autorestart=false to autorestart=unexpected
Add startsecs=1 and startretries=3 to constrain restart behavior for immediate start failures

You can also share your feedback on Copilot code review. Take the survey.

dockers/docker-orchagent/supervisord.conf.j2

Signed-off-by: Ze Gan <ganze718@gmail.com>

banidoru

Clean, well-scoped fix. The three supervisor settings work correctly together:

autorestart=unexpected recovers countersyncd after non-zero exits without interfering with intentional stops.
startsecs=10 ensures short-lived crash loops consume startretries rather than being treated as successful-then-crashed (which would restart indefinitely).
startretries=3 caps recovery attempts before supervisor gives up, preventing infinite restart loops.

The earlier feedback on startsecs has been addressed. No concerns with this change.

banidoru

Clean, minimal fix. The three supervisor knobs (autorestart=unexpected, startsecs=10, startretries=3) work correctly together: crashes within the first 10 s consume a retry (capped at 3), while crashes after 10 s trigger an immediate restart without burning retries. This matches the stated goal of recovering from unexpected exits without infinite crash loops. The prior review feedback on startsecs has been addressed. No concerns.

banidoru

Clean, minimal fix. Changing autorestart=false → autorestart=unexpected correctly allows supervisord to recover countersyncd after unexpected exits while still respecting intentional stops. startsecs=10 (updated from the original 1s per review feedback) ensures that short-lived crash loops consume the 3 retries before supervisor gives up, preventing infinite restart storms. No concerns with correctness, security, or design.

banidoru

All reviewers approved. LGTM.

Pterosaur · 2026-03-12T22:33:05Z

/azpw run

mssonicbld · 2026-03-12T22:33:07Z

/AzurePipelines run

azure-pipelines · 2026-03-12T22:33:21Z

Azure Pipelines successfully started running 1 pipeline(s).

Pterosaur · 2026-03-13T01:31:39Z

Created 202412 backport PR: Azure/sonic-buildimage-msft#2056

#### Why I did it Backport public SONiC PR sonic-net/sonic-buildimage#26147 to the `202412` branch. `swss:countersyncd` in orchagent is started through dependent-startup, but the current supervisor stanza leaves it at `EXITED` after an unexpected exit instead of recovering it. ##### Work item tracking - Microsoft ADO **(number only)**: N/A #### How I did it - cherry-pick the original backport-safe changes from sonic-net/sonic-buildimage#26147 - change `countersyncd` supervisor policy from `autorestart=false` to `autorestart=unexpected` - set `startsecs=10` - set `startretries=3` #### How to verify it Validated on 202412 DUT image `20241212.54`. Equivalent live-patched supervisor settings were applied on DUT `str4-sn5640-2`, then HFT telemetry validation was rerun successfully. Relevant result: - `high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters` - result: `1 passed` - `swss:countersyncd` remained `RUNNING` after telemetry capture #### Which release branch to backport (provide reason below if selected) - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 Base branch of this PR is already `202412`. #### Tested branch (Please provide the tested image version) - [x] 20241212.54 #### Description for the changelog Restart `swss:countersyncd` automatically after unexpected exits. #### Link to config_db schema for YANG module changes N/A #### A picture of a cute animal (not mandatory but encouraged) 🦦 --------- Signed-off-by: Ze Gan <ganze718@gmail.com>

kperumalbfn · 2026-03-13T16:51:27Z

@Pterosaur what are the reasons for unexpected exit of countersyncd??

mssonicbld · 2026-03-13T16:52:26Z

Cherry-pick PR to msft-202412:

Pterosaur · 2026-03-13T21:36:41Z

@Pterosaur what are the reasons for unexpected exit of countersyncd??

Hi @kperumalbfn , if the downstream service, such as otel collector, was not started or exited. We will proactively terminate countersyncd to drain the counters data.

swss: restart countersyncd after unexpected exit

7490f67

Signed-off-by: Ze Gan <ganze718@gmail.com>

Pterosaur added Request for msft-202412 Branch Request for 202511 Branch labels Mar 12, 2026

Pterosaur requested a review from r12f March 12, 2026 13:17

Pterosaur marked this pull request as ready for review March 12, 2026 13:17

Pterosaur requested a review from lguohan as a code owner March 12, 2026 13:17

Copilot AI review requested due to automatic review settings March 12, 2026 13:17

Copilot started reviewing on behalf of Pterosaur March 12, 2026 13:18 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

dockers/docker-orchagent/supervisord.conf.j2 Outdated Show resolved Hide resolved

swss: raise countersyncd startsecs to 10s

14dc70b

Signed-off-by: Ze Gan <ganze718@gmail.com>

banidoru approved these changes Mar 12, 2026

View reviewed changes

r12f approved these changes Mar 12, 2026

View reviewed changes

Pterosaur mentioned this pull request Mar 13, 2026

[swss] Restart countersyncd after unexpected exit Azure/sonic-buildimage-msft#2056

Merged

10 tasks

Pterosaur added the Created PR to msft-202412 Branch label Mar 13, 2026

Pterosaur added Approved for msft-202412 Branch Included in msft-202412 Branch and removed Created PR to msft-202412 Branch labels Mar 13, 2026

mssonicbld added the Cherry Pick Conflict_msft-202412 label Mar 13, 2026

Pterosaur removed the Cherry Pick Conflict_msft-202412 label Mar 13, 2026

kperumalbfn approved these changes Mar 13, 2026

View reviewed changes

kperumalbfn merged commit 1cdcadb into sonic-net:master Mar 13, 2026
19 checks passed

mssonicbld added Cherry Pick Conflict_msft-202412 Created PR to msft-202412 Branch labels Mar 13, 2026

Pterosaur removed Cherry Pick Conflict_msft-202412 Created PR to msft-202412 Branch labels Mar 13, 2026

Conversation

Pterosaur commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I did it

Work item tracking

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

banidoru left a comment

Choose a reason for hiding this comment

Uh oh!

Pterosaur commented Mar 12, 2026

Uh oh!

mssonicbld commented Mar 12, 2026

Uh oh!

azure-pipelines bot commented Mar 12, 2026

Uh oh!

Pterosaur commented Mar 13, 2026

Uh oh!

kperumalbfn commented Mar 13, 2026

Uh oh!

Uh oh!

mssonicbld commented Mar 13, 2026

Uh oh!

Pterosaur commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Pterosaur commented Mar 12, 2026 •

edited

Loading