[swss] Restart countersyncd after unexpected exit#26147
[swss] Restart countersyncd after unexpected exit#26147kperumalbfn merged 2 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Ze Gan <ganze718@gmail.com>
There was a problem hiding this comment.
Pull request overview
Adjusts the swss container’s supervisor configuration so countersyncd can recover automatically after unexpected exits, improving telemetry robustness without changing the dependent-startup ordering.
Changes:
- Switch
countersyncdfromautorestart=falsetoautorestart=unexpected - Add
startsecs=1andstartretries=3to constrain restart behavior for immediate start failures
You can also share your feedback on Copilot code review. Take the survey.
Signed-off-by: Ze Gan <ganze718@gmail.com>
banidoru
left a comment
There was a problem hiding this comment.
Clean, well-scoped fix. The three supervisor settings work correctly together:
autorestart=unexpectedrecovers countersyncd after non-zero exits without interfering with intentional stops.startsecs=10ensures short-lived crash loops consumestartretriesrather than being treated as successful-then-crashed (which would restart indefinitely).startretries=3caps recovery attempts before supervisor gives up, preventing infinite restart loops.
The earlier feedback on startsecs has been addressed. No concerns with this change.
banidoru
left a comment
There was a problem hiding this comment.
Clean, minimal fix. The three supervisor knobs (autorestart=unexpected, startsecs=10, startretries=3) work correctly together: crashes within the first 10 s consume a retry (capped at 3), while crashes after 10 s trigger an immediate restart without burning retries. This matches the stated goal of recovering from unexpected exits without infinite crash loops. The prior review feedback on startsecs has been addressed. No concerns.
banidoru
left a comment
There was a problem hiding this comment.
Clean, minimal fix. Changing autorestart=false → autorestart=unexpected correctly allows supervisord to recover countersyncd after unexpected exits while still respecting intentional stops. startsecs=10 (updated from the original 1s per review feedback) ensures that short-lived crash loops consume the 3 retries before supervisor gives up, preventing infinite restart storms. No concerns with correctness, security, or design.
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Created 202412 backport PR: Azure/sonic-buildimage-msft#2056 |
#### Why I did it Backport public SONiC PR sonic-net/sonic-buildimage#26147 to the `202412` branch. `swss:countersyncd` in orchagent is started through dependent-startup, but the current supervisor stanza leaves it at `EXITED` after an unexpected exit instead of recovering it. ##### Work item tracking - Microsoft ADO **(number only)**: N/A #### How I did it - cherry-pick the original backport-safe changes from sonic-net/sonic-buildimage#26147 - change `countersyncd` supervisor policy from `autorestart=false` to `autorestart=unexpected` - set `startsecs=10` - set `startretries=3` #### How to verify it Validated on 202412 DUT image `20241212.54`. Equivalent live-patched supervisor settings were applied on DUT `str4-sn5640-2`, then HFT telemetry validation was rerun successfully. Relevant result: - `high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters` - result: `1 passed` - `swss:countersyncd` remained `RUNNING` after telemetry capture #### Which release branch to backport (provide reason below if selected) - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 Base branch of this PR is already `202412`. #### Tested branch (Please provide the tested image version) - [x] 20241212.54 #### Description for the changelog Restart `swss:countersyncd` automatically after unexpected exits. #### Link to config_db schema for YANG module changes N/A #### A picture of a cute animal (not mandatory but encouraged) 🦦 --------- Signed-off-by: Ze Gan <ganze718@gmail.com>
|
@Pterosaur what are the reasons for unexpected exit of countersyncd?? |
|
Cherry-pick PR to msft-202412: |
Hi @kperumalbfn , if the downstream service, such as otel collector, was not started or exited. We will proactively terminate countersyncd to drain the counters data. |
Why I did it
swss:countersyncdis currently started through dependent-startup, but its supervisor stanza usesautorestart=false.As a result, when
countersyncdexits unexpectedly, supervisor leaves it inEXITEDinstead of recovering it.That behavior was reproduced on
str4-sn5640-2, where HFT telemetry validation could pass whilecountersyncdstill stayed down afterward.Work item tracking
How I did it
countersyncdfromautorestart=falsetoautorestart=unexpectedstartsecs=10startretries=3This keeps the existing dependent-startup flow, allows supervisor to recover
countersyncdafter unexpected exits, and givesstartretriesa meaningful guardrail for short-lived crash loops before a process is considered stably started.How to verify it
countersyncdsupervisor stanza change on the DUTswsscontainer../run_tests.sh -u -n vms70-t0-sn5640-2 -i ../ansible/str4,../ansible/veos -l info -m individual -e --skip_sanity -e --disable_loganalyzer -c "high_frequency_telemetry/test_high_frequency_telemetry.py::test_hft_port_counters"docker exec swss supervisorctl status countersyncdreportsRUNNINGafter the testObserved result from live DUT validation:
str4-sn5640-2x86_64-nvidia_sn5640-r0/Mellanox-SN5640-C512S220241212.541 passedWhich release branch to backport (provide reason below if selected)
Backport reason:
swss.msft-202412via label because unexpectedcountersyncdexits leave the supervisor-managed service down until manual intervention.Tested branch (Please provide the tested image version)
str4-sn5640-2)Description for the changelog
Restart
countersyncdautomatically after unexpected exits inswss.Link to config_db schema for YANG module changes
N/A
A picture of a cute animal (not mandatory but encouraged)
🦦