Skip to content

Add missing [Install] section to container service templates#25932

Closed
StormLiangMS wants to merge 1 commit intosonic-net:masterfrom
StormLiangMS:fix-service-install-section-sonic-target
Closed

Add missing [Install] section to container service templates#25932
StormLiangMS wants to merge 1 commit intosonic-net:masterfrom
StormLiangMS:fix-service-install-section-sonic-target

Conversation

@StormLiangMS
Copy link
Contributor

Why I did it

After the systemd-sonic-generator rework (PR #23340), the generator only creates sonic.target.wants/ symlinks for services that have an explicit [Install] section with WantedBy=. Nine container services (pmon, lldp, gnmi, snmp, telemetry, otel, sflow, bmp, mgmt-framework) use BindsTo=sonic.target in [Unit] but lacked an [Install] section, so the generator skipped creating symlinks for them.

This caused two problems:

  1. _reset_failed_services() in sonic-utilities iterates systemctl list-dependencies --plain sonic.target and never resets rate limits for these services, causing start-limit-hit after multiple config reloads.
  2. featured daemon checks unit_file_state == 'enabled' but these services now report static (no [Install] = static), causing redundant systemctl start calls on every reload.

The most visible symptom is pmon hitting start-limit-hit during tests that perform multiple config reloads (e.g., test_load_minigraph_with_golden_config).

Fixes #25931

Work item tracking
  • Microsoft ADO: 36811868

How I did it

Added [Install] section with WantedBy=sonic.target to all 9 affected service templates, consistent with other container services (dhcp_relay, swss, syncd, teamd, etc.) that already have this section.

Affected templates:

  • files/build_templates/pmon.service.j2
  • files/build_templates/gnmi.service.j2
  • files/build_templates/snmp.service.j2
  • files/build_templates/telemetry.service.j2
  • files/build_templates/otel.service.j2
  • files/build_templates/sflow.service.j2
  • files/build_templates/mgmt-framework.service.j2
  • files/build_templates/per_namespace/lldp.service.j2
  • files/build_templates/per_namespace/bmp.service.j2

How to verify it

  1. Build an image with this change
  2. On a DUT, verify services appear in sonic.target dependencies:
    systemctl list-dependencies --plain sonic.target | grep pmon
  3. Verify UnitFileState is no longer static:
    systemctl show pmon.service --property=UnitFileState
  4. Run test_load_minigraph_with_golden_config — pmon should not hit start-limit-hit

Workaround verified on testbed: Manually creating the symlinks on a live DUT confirmed the fix resolves the issue.

Which release branch to backport (provide reason below if selected)

  • 202511

The bug was introduced by the systemd-sonic-generator rework cherry-picked to 202511 via PR #24988.

Tested branch (Please provide the tested image version)

  • Workaround verified on 20251110.12 (202511)

Description for the changelog

Add missing [Install] WantedBy=sonic.target to 9 container service templates to fix start-limit-hit failures after config reloads.

A picture of a cute animal (not mandatory but encouraged)

🦔

After the systemd-sonic-generator rework (PR sonic-net#23340), the generator
only creates sonic.target.wants/ symlinks for services that have an
explicit [Install] section with WantedBy=. Container services (pmon,
lldp, gnmi, snmp, telemetry, otel, sflow, bmp, mgmt-framework) use
BindsTo=sonic.target but lacked an [Install] section, so the generator
skipped creating symlinks for them.

This caused two problems:
1. _reset_failed_services() iterates sonic.target dependencies and
   never resets rate limits for these services, causing start-limit-hit
   after multiple config reloads.
2. featured checks unit_file_state == 'enabled' but these services
   report 'static', causing redundant start attempts on every reload.

Fix by adding [Install] WantedBy=sonic.target to all affected service
templates, consistent with other container services like dhcp_relay,
swss, syncd, and teamd that already have this section.

Fixes: sonic-net#25931
Signed-off-by: Storm Liang <[email protected]>
@StormLiangMS StormLiangMS requested a review from lguohan as a code owner March 6, 2026 09:10
Copilot AI review requested due to automatic review settings March 6, 2026 09:10
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds the missing [Install] section with WantedBy=sonic.target to 9 container service templates that were affected by the systemd-sonic-generator rework (PR #23340, cherry-picked to 202511 in PR #24988). Without this section, the generator couldn't create sonic.target.wants/ symlinks for these services, leading to start-limit-hit failures after multiple config reloads (issue #25931).

Changes:

  • Added [Install] section with WantedBy=sonic.target to 9 container service templates (pmon, gnmi, snmp, telemetry, otel, sflow, mgmt-framework, lldp, bmp) to match the pattern already used by other container services like dhcp_relay, swss, syncd, and teamd.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
files/build_templates/pmon.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/gnmi.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/snmp.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/telemetry.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/otel.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/sflow.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/mgmt-framework.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/per_namespace/lldp.service.j2 Add [Install] section with WantedBy=sonic.target
files/build_templates/per_namespace/bmp.service.j2 Add [Install] section with WantedBy=sonic.target

@StormLiangMS
Copy link
Contributor Author

Cross-reference: Related fix in sonic-utilities

Related PR: sonic-net/sonic-utilities#4314 by @stephenxs fixes the same symptom from the sonic-utilities side by adding --reverse dependencies to _reset_failed_services().

Root cause analysis

After deeper investigation, we found the issue has three contributing factors:

  1. Missing [Install] section (fixed by this PR): 9 container service templates (pmon, lldp, gnmi, snmp, telemetry, otel, sflow, bmp, mgmt-framework) have BindsTo=sonic.target but no [Install] WantedBy=sonic.target. After the systemd-sonic-generator rework (PR Trixie base image upgrade #23340), the generator only creates sonic.target.wants/ symlinks based on [Install] targets — so these services are no longer listed as sonic.target dependencies.

  2. _reset_failed_services() misses these services (fixed by Fix issue: pmon services's restart count is not cleared during config reload sonic-utilities#4314): It iterates systemctl list-dependencies --plain sonic.target which no longer includes these services. Rate limit counters are never reset between config reloads.

  3. featured daemon issues redundant systemctl start calls (fixed by this PR as a side effect): On 202511, enable_feature() checks if unit_file_state == 'enabled': continue — but since UnitFileState is static (no [Install]), the check fails. Featured then runs systemctl enable (which fails silently due to the raise_exception=False change for Trixie compatibility) and proceeds to systemctl start — adding an extra start attempt on every config reload.

Why 202505 doesn't have this issue

On 202505, featured's enable_feature() also fails on systemctl enable for static services, but it uses raise_exception=True → the exception is caught → feature state is set to FAILED → systemctl start is never reached. On 202511, the raise_exception=False change (commit for Trixie compatibility) causes the enable failure to be silently ignored, so systemctl start proceeds — adding extra start attempts that push pmon past StartLimitBurst=3.

How the two PRs complement each other

Fix What it addresses
This PR (sonic-buildimage#25932) Adds [Install] WantedBy=sonic.target → services become proper sonic.target dependencies, UnitFileState becomes enabled, featured skips already-enabled services, _reset_failed_services covers them
sonic-utilities#4314 Adds --reverse to _get_sonic_services()_reset_failed_services resets rate limits for services with BindsTo=sonic.target even without [Install]

Both PRs are valid fixes. This PR is the more complete fix (addresses all 3 factors), while sonic-utilities#4314 provides defense-in-depth for the _reset_failed_services path. Ideally both should merge.

cc @stephenxs @saiarcot895

@StormLiangMS StormLiangMS requested a review from saiarcot895 March 6, 2026 14:21
StormLiangMS added a commit to StormLiangMS/sonic-mgmt that referenced this pull request Mar 6, 2026
Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
RestartSec=30

[Install]
WantedBy=sonic.target
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the services that don't have WantedBy=sonic.target appear to be intentionally delayed after port init (i.e. start after everything else has started). This might break this.

/var/log/syslog.2.gz:2026 Mar  5 17:27:43.300869 vlab-01 INFO featured: Feature is gnmi delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:43.951951 vlab-01 INFO featured: Feature is lldp delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:45.212249 vlab-01 INFO featured: Feature is mgmt-framework delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:47.239979 vlab-01 INFO featured: Feature is pmon delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:48.117732 vlab-01 INFO featured: Feature is sflow delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:48.526676 vlab-01 INFO featured: Feature is snmp delayed for port init
/var/log/syslog.2.gz:2026 Mar  5 17:27:50.220405 vlab-01 INFO featured: Updating delayed features after port initializatio

@saiarcot895
Copy link
Contributor

After the systemd-sonic-generator rework (PR #23340), the generator only creates sonic.target.wants/ symlinks for services that have an explicit [Install] section with WantedBy=.

featured daemon checks unit_file_state == 'enabled' but these services now report static (no [Install] = static)

Both of these match 202505 behavior, and are not new for 202511.

@StormLiangMS
Copy link
Contributor Author

Thanks @saiarcot895 for the review — both points are valid.

On the 202505 behavior being the same

You're correct. After deeper investigation we confirmed that on 202505:

  • \UnitFileState=static\ (same as 202511)
  • pmon is NOT in \sonic.target\ dependencies (same as 202511)
  • \_reset_failed_services()\ also misses pmon (same code, same behavior)

The actual differentiator between 202505 and 202511 is in \ eatured\'s \�nable_feature()\:

  • 202505 (line 420-428): \
    un_cmd(cmd, raise_exception=True)\ for all commands → \systemctl enable pmon\ fails (static service) → exception caught → sets state to FAILED → \systemctl start\ is never reached → no extra start attempt
  • 202511 (line 440-452): \
    aise_exception=False\ for enable commands (Trixie compatibility) → \systemctl enable\ fails silently → proceeds to \systemctl start\ → extra start attempt on every config reload → exceeds \StartLimitBurst=3\ after 3+ reloads

On the port-init delay concern

Good catch — this is a valid concern. These services are intentionally delayed by \ eatured\ until \PortInitDone\. Adding \WantedBy=sonic.target\ could cause systemd to start them immediately as part of \sonic.target\, bypassing the port-init delay logic.

Given these points, I think the safer approach is:

  1. Fix issue: pmon services's restart count is not cleared during config reload sonic-utilities#4314 (by @stephenxs) — fix \_reset_failed_services()\ to reset rate limits for all SONiC services
  2. Possibly fix \ eatured\ — either skip the redundant start for static services, or handle the enable failure differently

I'll close this PR in favor of the sonic-utilities approach. Thanks for the thorough review!

wangxin pushed a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 7, 2026
Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Mar 7, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
StormLiangMS added a commit to StormLiangMS/sonic-mgmt that referenced this pull request Mar 10, 2026
Cherry-pick of PR sonic-net#22775 to 202511, rebased on latest 202511 to include
the duplicate key fix from PR sonic-net#22796.

The test performs multiple config reloads that cause pmon start-limit-hit
due to missing sonic.target symlinks after systemd-sonic-generator rework.

Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314
Tracking issue: sonic-net/sonic-buildimage#25931

Signed-off-by: Storm Liang <[email protected]>

Co-authored-by: Copilot <[email protected]>
StormLiangMS added a commit to sonic-net/sonic-mgmt that referenced this pull request Mar 10, 2026
…fig (#22830)

Cherry-pick of PR #22775 to 202511, rebased on latest 202511 to include
the duplicate key fix from PR #22796.

The test performs multiple config reloads that cause pmon start-limit-hit
due to missing sonic.target symlinks after systemd-sonic-generator rework.

Fix PRs: sonic-net/sonic-buildimage#25932, sonic-net/sonic-utilities#4314
Tracking issue: sonic-net/sonic-buildimage#25931

Co-authored-by: Copilot <[email protected]>
ksravani-hcl pushed a commit to ksravani-hcl/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
ksravani-hcl pushed a commit to ksravani-hcl/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
aronovic pushed a commit to aronovic/sonic-mgmt that referenced this pull request Mar 10, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Mihut Aronovici <[email protected]>
selldinesh pushed a commit to selldinesh/sonic-mgmt that referenced this pull request Mar 16, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: selldinesh <[email protected]>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Abhishek <[email protected]>
vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
…c-net#22775)

Skip test_load_minigraph_with_golden_config when issue #25931 is open.
This test performs 4 consecutive config reloads which causes pmon to hit
start-limit-hit due to missing sonic.target.wants/ symlinks after the
systemd-sonic-generator rework (sonic-net/sonic-buildimage#23340).

The test leaves pmon in a bad state (start-limit-hit), which can affect
subsequent tests in the nightly run.

Fix PRs:
- sonic-net/sonic-buildimage#25932 (add [Install] to service templates)
- sonic-net/sonic-utilities#4314 (fix _reset_failed_services)

The skip will auto-resolve when sonic-net/sonic-buildimage#25931 is closed.

Signed-off-by: Storm Liang <[email protected]>
Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: systemd-sonic-generator rework causes container services to hit start-limit-hit after multiple config reloads (202511)

4 participants