Skip to content

[swss] Restart countersyncd after unexpected exit#26147

Merged
kperumalbfn merged 2 commits intosonic-net:masterfrom
Pterosaur:claw/countersyncd-autorestart
Mar 13, 2026
Merged

[swss] Restart countersyncd after unexpected exit#26147
kperumalbfn merged 2 commits intosonic-net:masterfrom
Pterosaur:claw/countersyncd-autorestart

Conversation

@Pterosaur
Copy link
Contributor

@Pterosaur Pterosaur commented Mar 12, 2026

Why I did it

swss:countersyncd is currently started through dependent-startup, but its supervisor stanza uses autorestart=false.
As a result, when countersyncd exits unexpectedly, supervisor leaves it in EXITED instead of recovering it.
That behavior was reproduced on str4-sn5640-2, where HFT telemetry validation could pass while countersyncd still stayed down afterward.

Work item tracking
  • Microsoft ADO (number only): N/A

How I did it

  • change countersyncd from autorestart=false to autorestart=unexpected
  • add startsecs=10
  • add startretries=3

This keeps the existing dependent-startup flow, allows supervisor to recover countersyncd after unexpected exits, and gives startretries a meaningful guardrail for short-lived crash loops before a process is considered stably started.

How to verify it

  1. Apply the countersyncd supervisor stanza change on the DUT swss container.
  2. Run:
    ./run_tests.sh -u -n vms70-t0-sn5640-2 -i ../ansible/str4,../ansible/veos -l info -m individual -e --skip_sanity -e --disable_loganalyzer -c "high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters"
  3. Confirm:
    • HFT telemetry validation still passes
    • docker exec swss supervisorctl status countersyncd reports RUNNING after the test

Observed result from live DUT validation:

  • DUT: str4-sn5640-2
  • Platform: x86_64-nvidia_sn5640-r0 / Mellanox-SN5640-C512S2
  • Image: 20241212.54
  • Test result: 1 passed

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Backport reason:

  • This is a bug fix for service recovery behavior in swss.
  • The same PR is also requested for msft-202412 via label because unexpected countersyncd exits leave the supervisor-managed service down until manual intervention.

Tested branch (Please provide the tested image version)

  • 20241212.54 (live-patched DUT validation on str4-sn5640-2)

Description for the changelog

Restart countersyncd automatically after unexpected exits in swss.

Link to config_db schema for YANG module changes

N/A

A picture of a cute animal (not mandatory but encouraged)

🦦

Signed-off-by: Ze Gan <ganze718@gmail.com>
@Pterosaur Pterosaur requested a review from r12f March 12, 2026 13:17
@Pterosaur Pterosaur marked this pull request as ready for review March 12, 2026 13:17
@Pterosaur Pterosaur requested a review from lguohan as a code owner March 12, 2026 13:17
Copilot AI review requested due to automatic review settings March 12, 2026 13:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the swss container’s supervisor configuration so countersyncd can recover automatically after unexpected exits, improving telemetry robustness without changing the dependent-startup ordering.

Changes:

  • Switch countersyncd from autorestart=false to autorestart=unexpected
  • Add startsecs=1 and startretries=3 to constrain restart behavior for immediate start failures

You can also share your feedback on Copilot code review. Take the survey.

Signed-off-by: Ze Gan <ganze718@gmail.com>
Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, well-scoped fix. The three supervisor settings work correctly together:

  • autorestart=unexpected recovers countersyncd after non-zero exits without interfering with intentional stops.
  • startsecs=10 ensures short-lived crash loops consume startretries rather than being treated as successful-then-crashed (which would restart indefinitely).
  • startretries=3 caps recovery attempts before supervisor gives up, preventing infinite restart loops.

The earlier feedback on startsecs has been addressed. No concerns with this change.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, minimal fix. The three supervisor knobs (autorestart=unexpected, startsecs=10, startretries=3) work correctly together: crashes within the first 10 s consume a retry (capped at 3), while crashes after 10 s trigger an immediate restart without burning retries. This matches the stated goal of recovering from unexpected exits without infinite crash loops. The prior review feedback on startsecs has been addressed. No concerns.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, minimal fix. Changing autorestart=falseautorestart=unexpected correctly allows supervisord to recover countersyncd after unexpected exits while still respecting intentional stops. startsecs=10 (updated from the original 1s per review feedback) ensures that short-lived crash loops consume the 3 retries before supervisor gives up, preventing infinite restart storms. No concerns with correctness, security, or design.

Copy link

@banidoru banidoru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All reviewers approved. LGTM.

@Pterosaur
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Pterosaur
Copy link
Contributor Author

Created 202412 backport PR: Azure/sonic-buildimage-msft#2056

Pterosaur added a commit to Azure/sonic-buildimage-msft that referenced this pull request Mar 13, 2026
#### Why I did it

Backport public SONiC PR sonic-net/sonic-buildimage#26147 to the
`202412` branch.

`swss:countersyncd` in orchagent is started through dependent-startup,
but the current supervisor stanza leaves it at `EXITED` after an
unexpected exit instead of recovering it.

##### Work item tracking
- Microsoft ADO **(number only)**: N/A

#### How I did it

- cherry-pick the original backport-safe changes from
sonic-net/sonic-buildimage#26147
- change `countersyncd` supervisor policy from `autorestart=false` to
`autorestart=unexpected`
- set `startsecs=10`
- set `startretries=3`

#### How to verify it

Validated on 202412 DUT image `20241212.54`.

Equivalent live-patched supervisor settings were applied on DUT
`str4-sn5640-2`, then HFT telemetry validation was rerun successfully.

Relevant result:
-
`high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters`
- result: `1 passed`
- `swss:countersyncd` remained `RUNNING` after telemetry capture

#### Which release branch to backport (provide reason below if selected)

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

Base branch of this PR is already `202412`.

#### Tested branch (Please provide the tested image version)

- [x] 20241212.54

#### Description for the changelog
Restart `swss:countersyncd` automatically after unexpected exits.

#### Link to config_db schema for YANG module changes
N/A

#### A picture of a cute animal (not mandatory but encouraged)
🦦

---------

Signed-off-by: Ze Gan <ganze718@gmail.com>
@kperumalbfn
Copy link
Contributor

@Pterosaur what are the reasons for unexpected exit of countersyncd??

@kperumalbfn kperumalbfn merged commit 1cdcadb into sonic-net:master Mar 13, 2026
19 checks passed
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202412:

@Pterosaur
Copy link
Contributor Author

@Pterosaur what are the reasons for unexpected exit of countersyncd??

Hi @kperumalbfn , if the downstream service, such as otel collector, was not started or exited. We will proactively terminate countersyncd to drain the counters data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants