Skip to content

Fix rsyslogd memory growth in syncd swss containers over long term#14

Closed
tirupatihemanth wants to merge 13 commits intomasterfrom
rsyslogd_fix
Closed

Fix rsyslogd memory growth in syncd swss containers over long term#14
tirupatihemanth wants to merge 13 commits intomasterfrom
rsyslogd_fix

Conversation

@tirupatihemanth
Copy link
Copy Markdown
Owner

@tirupatihemanth tirupatihemanth commented Mar 3, 2026

Why I did it

  1. We observed long-term rsyslogd memory growth in syncd container.
  2. Deep diagnostics (impstats) showed imuxsock.ratelimit.numratelimiters growing continuously (about ~2/min), while queue depth stayed near zero, indicating sender/PID churn rather than queue backlog.
  3. phcsync.sh runs every 60 seconds and repeatedly invokes phc_ctl for /dev/ptp* devices. These short-lived process invocations contribute to new sender identities seen by imuxsock, which correlates with ratelimiter-state growth and memory increase over time because of data structures stored by rsyslogd for ratelimiting.
Work item tracking
  • Microsoft ADO (number only):

How I did it

  • Updated phcsync.sh in SONiC to keep successful phc_ctl execution silent:
  • Use phc_ctl -q -Q ... >/dev/null 2>&1
  • Keep explicit error handling and error logs on non-zero exit.
  • Added stable logger identity in service debug helpers:
  • logger -i "$$" -- "$1" in syncd_common.sh and swss.sh. This reduces per-call sender churn during script execution phases (start/wait/stop).

syncd
Every second we currently see following log from syncd and it creates a new ratelimiter context because of new PID each time

syslog.1:15477:2026 Mar  2 22:25:01.754471 sonic NOTICE syncd#phc_ctl: [561375.455] set clock time to 1772490301.754287973 or Mon Mar  2 22:25:01 2026

logger commands
before

Mar 04 03:55:44 sonic root[1775781]: Starting swss service...
Mar 04 03:55:44 sonic root[1775785]: Locking /tmp/swss-syncd-lock from swss service
Mar 04 03:55:44 sonic root[1775792]: Locked /tmp/swss-syncd-lock (10) from swss service
Mar 04 03:55:44 sonic root[1775816]: Warm boot flag: swss false.
Mar 04 03:55:44 sonic root[1775822]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
Mar 04 03:55:45 sonic root[1776045]: Started swss service...
Mar 04 03:55:45 sonic root[1776051]: Unlocking /tmp/swss-syncd-lock (10) from swss service

After

Mar 04 03:58:52 sonic root[1891651]: Starting swss service...
Mar 04 03:58:52 sonic root[1891651]: Locking /tmp/swss-syncd-lock from swss service
Mar 04 03:58:52 sonic root[1891651]: Locked /tmp/swss-syncd-lock (10) from swss service
Mar 04 03:58:52 sonic root[1891651]: Warm boot flag: swss false.
Mar 04 03:58:52 sonic root[1891651]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ...
Mar 04 03:58:53 sonic root[1891651]: Started swss service...
Mar 04 03:58:53 sonic root[1891651]: Unlocking /tmp/swss-syncd-lock (10) from swss service

How to verify it

  • imuxsock.ratelimit.numratelimiters in syncd should stop continuous growth (or reduce drastically).

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

@tirupatihemanth tirupatihemanth changed the title Fix rsyslogd memory growth in syncd wjh swss containers over long term Fix rsyslogd memory growth in syncd swss containers over long term Mar 3, 2026
@tirupatihemanth tirupatihemanth force-pushed the rsyslogd_fix branch 2 times, most recently from eb207a6 to a72dfb4 Compare March 4, 2026 04:01
tirupatihemanth pushed a commit that referenced this pull request Mar 13, 2026
…net#25643)

* [build] Add build timing report and dependency analysis tools

Add three scripts for build performance instrumentation:

- scripts/build-timing-report.sh: Parse per-package timing from build
  logs (HEADER/FOOTER timestamps), generate sorted duration table,
  phase breakdown, parallelism timeline, and CSV export.

- scripts/build-dep-graph.py: Parse rules/*.mk dependency graph,
  compute critical path, fan-out/fan-in bottleneck analysis, and
  generate DOT/JSON output for visualization.

- scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O,
  and Docker container count during builds for resource utilization
  analysis.

Add "make build-report" target to slave.mk that runs the timing
report and dependency analysis after a build completes.

Example output from a VS build on 24-core/30GB machine:
- 210 packages built in 53m wall time (173m CPU)
- Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4)
- Critical path: 14 packages deep (libnl -> libswsscommon -> utilities)
- Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents

Signed-off-by: Rustiqly <[email protected]>

* Address Copilot review: fix 17 bugs in build analysis scripts

- Use free -m with division instead of free -g to avoid rounding (#1)
- Add = and ?= to Makefile dependency regex patterns (#2, #7)
- CPU calculation now uses /proc/stat delta (two reads) (#3, #14)
- Fix misleading 'critical path estimate' comment (#4)
- Fix parallelism timeline comment (60s not 10s) (#5)
- Include after-relationship packages in fan stats (#6)
- Guard disk I/O division by zero when INTERVAL<=1 (#8)
- Remove unused elapsed_line variable (#9)
- Remove redundant LIBSWSSCOMMON_DBG check (#10)
- Remove active_make_jobs from CSV header comment (#11)
- Wire up _RDEPENDS parsing to build reverse deps (#12)
- Remove unnecessary 'if v' filter on rdeps JSON (#13)
- Remove unused REPORT_FORMAT parameter (#15)
- Add cycle detection to critical path algorithm (#16)
- Add execute permission check for companion scripts (#17)

Signed-off-by: Rustiqly <[email protected]>

---------

Signed-off-by: Rustiqly <[email protected]>
Co-authored-by: Rustiqly <[email protected]>
auspham and others added 13 commits March 17, 2026 11:54
Why I did it
After sonic-net#25876, there are still some left over vulnerabilities. Majority is gnoic and go library. Some new introduces by the upgrade of protobuf==6.31.1.

This PR will try to address them
Signed-off-by: Austin Pham <[email protected]>
Why I did it
Gate LLDP and SNMP docker features behind INCLUDE_LLDP and INCLUDE_SNMP build flags, consistent with how other optional features (e.g., SFLOW, NAT, TEAMD) are already gated. This allows platforms like nvidia-bluefield and pensando to properly disable these features via their DISABLED_FEATURE_FLAGS.

A follow-up changed needed after sonic-net#25032

Fixes: sonic-net#25891

Work item tracking
Microsoft ADO (number only):
How I did it
Removed lldp and snmp from the static features list in init_cfg.json.j2 and added them as conditional features gated by include_lldp and include_snmp variables.
Added INCLUDE_SNMP ?= y and INCLUDE_LLDP ?= y config flags in rules/config (defaulting to enabled).
Exported include_snmp and include_lldp variables in slave.mk so they are available during image build.
Added INCLUDE_SNMP and INCLUDE_LLDP to DISABLED_FEATURE_FLAGS in nvidia-bluefield and pensando platform makefiles.
How to verify it
Build nvidia-bluefield or pensando images and verify LLDP/SNMP are disabled.
… image (sonic-net#26014)

* BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on TH1 legacy image

Add SAI_STATS_EXT_SWITCH_SUPPORTED=0 to sai.profile for Arista 7060cx
(BCM56960/Tomahawk-1) to disable sai_get_stats_ext for switch objects.
The legacy SAI binary crashes when FlexCounter calls sai_get_stats_ext
on switch objects during polling.

The runtime guard is implemented in sonic-sairedis PR sonic-net#1789.

Signed-off-by: Liping Xu <[email protected]>

* BROADCOM_LEGACY_SAI_COMPAT: Add missing Q32 HWSKU sai.profile.j2 keys for TH1

Arista-7060CX-32S-Q32 uses a Jinja2 template (sai.profile.j2) rather than
a static sai.profile. Add SAI_STATS_ST_CAPABILITY_SUPPORTED=0 and
SAI_STATS_EXT_SWITCH_SUPPORTED=0 to cover Q32 as well as the other
Arista-7060CX-32S HWSKUs.

Signed-off-by: Liping Xu <[email protected]>

* BROADCOM_LEGACY_SAI_COMPAT: Fix Q32 sai.profile.j2 - restore single quotes, keep SAI keys

Restore original Jinja2 single-quote style (changed unintentionally in the
previous commit). Only intended change is adding SAI_STATS_ST_CAPABILITY_SUPPORTED=0
and SAI_STATS_EXT_SWITCH_SUPPORTED=0 for TH1/BCM56960.

Signed-off-by: Liping Xu <[email protected]>

---------

Signed-off-by: Liping Xu <[email protected]>
…net#25357)

Why I did it: Improve ACL YANG model by enforcing that TCP_FLAGS can only be used in ACL table types that explicitly support this match field, ensuring correct model behavior and configuration validation.

How I did it: Updated the ACL YANG model to add a must constraint for TCP_FLAGS. Added/updated test cases and configuration to verify the new constraint.

How to verify it: Validate the ACL YANG model with various configs. Incorrect usage of TCP_FLAGS now triggers a must constraint error.

Signed-off-by: Xincun Li <[email protected]>
…t#25963)

What is the motivation for this PR:
- Add CI build target for aspeed-arm64 so sonic-aspeed-arm64.bin is produced for AST2700 BMC; platform support exists but no CI target.

How did you do it:
- Added aspeed-arm64 build target in azure-pipelines.yml with PLATFORM_NAME=aspeed and PLATFORM_ARCH=arm64 on sonicso1ES-arm64 pool.

How did you verify/test it:
- Pipeline run produced sonic-aspeed-arm64.bin; installed on NextHop B27 and verified 7 containers running.

Signed-off-by:
- zitingguo <[email protected]>
…e missing (sonic-net#26180)

Why I did it
Currently, if any of restapi's root CA cert, server cert, or server key are missing, restapi_watchdog reports the restapi's status as unhealthy. This can cause some issues in production (e.g., the restapi container might be repeatedly restarted).

Work item tracking
Microsoft ADO (number only): 37113632
How I did it
restapi_watchdog first looks at restapi's cert and key paths in CONFIG DB. If there is an error (very unlikely) or any of these paths are not defined, it assumes that the certs are missing. Otherwise, it checks if they exist. If any of the certs or the server key are missing, it returns an HTTP 200 OK response. Otherwise, it will check the restapi's status.
Note: restapi_watchdog can only check the /etc/sonic/credentials/ directory. If the cert or key paths point to files in other directories, then restapi_watchdog assumes that they exist.

How to verify it
Build the restapi_watchdog docker image and create a restapi_watchdog container on your switch:

$ docker load < docker-restapi-watchdog.gz
$ docker create --net=host -t -v /etc/localtime:/etc/localtime:ro -v /etc/sonic/credentials:/etc/sonic/credentials:ro --name="restapi_watchdog" docker-restapi-watchdog:latest
$ docker start restapi_watchdog
…onic-net#26144)

* [lldp] Fix transient MAC address Port ID during LLDP daemon startup

When lldpd starts (or restarts), it defaults to using MAC addresses as
Port IDs for all interfaces. The lldpmgrd daemon later reconfigures each
port with the correct interface alias via lldpcli, but there is a 2-3
second window where lldpd has already auto-resumed and sent LLDP frames
with MAC-based Port IDs.

This causes peers to see a transient MSAP change: first a neighbor entry
with MAC Port ID appears, then a shutdown frame, then a new entry with
the correct interface name. This neighbor flapping can trigger monitoring
alerts and confuse network management systems.

Root cause: lldpd internally auto-resumes after processing all config
file lines. The existing pause directive in lldpd.conf is redundant
(lldpd starts paused by default) and gets overridden by the auto-resume.
Since no port ID configs exist in the initial config, the first frames
use the default MAC-based Port IDs.

Fix: Add portidsubtype configuration for all front-panel ports directly
in the lldpd.conf.j2 Jinja2 template. These configs are processed during
lldpd startup config loading, before the auto-resume fires, ensuring the
very first LLDP frame carries the correct interface alias as Port ID.

Verified on Arista-7260CX3-C64 (SONiC.20251110.15) with tcpdump capture
showing the first LLDP frame after restart has Subtype Local (7) with
correct port alias (e.g., Ethernet1/1) instead of MAC address.

- Sort PORT iteration with |sort for deterministic config ordering
- Filter out special ports (Ethernet-IB/Rec/BP) matching lldpmgrd behavior
- Fallback to port_name when alias is missing or empty
- Add unit test with PORT table covering aliases, special ports, and
  missing alias scenarios

Signed-off-by: Zhaohui Sun <[email protected]>
How I did it
562f89b (HEAD -> main, origin/main, origin/HEAD) [H6-128] Change HWSKU name and remove port I2C clock setting (sonic-net#44)
1babc9b [H6-64] Update thermal algo, FW upgrade, PSU modules (sonic-net#43)
d742aeb [H6-128] Add support for H6-128 Platform (sonic-net#42)
f2dbdd7 Fix the UnboundLocalError to address thermalctld stacktrace
54e453a Add _cpm_firmware_upgrade_reboot_IMMs method
908d62a Fix the Unable_to_reach_CPM reboot-cause show issue

Signed-off-by: y7zhou <[email protected]>
Why I did it
arp_update logs for static routes showed entries like:
98 static route nexthop not resolved ...
which could be misinterpreted as 98 unreachable servers.

This change modifies the logger() function to remove PID argument.

Work item tracking
Microsoft ADO (number only):
How I did it
PID in arp_update logs is not needed.
Keeping only tag makes logs cleaner while preserving filterability (-t arp_update).
How to verify it

Signed-off-by: Janet Cui <[email protected]>
… automatically (sonic-net#26219)

#### Why I did it
src/sonic-platform-common
```
* 972ff46 - (HEAD -> master, origin/master, origin/HEAD) Add bank parameter to SfpBase and SfpOptoeBase class (sonic-net#632) (23 hours ago) [Bobby McGonigle]
```
#### How I did it
#### How to verify it
#### Description for the changelog
…atically (sonic-net#26224)

#### Why I did it
src/sonic-utilities
```
* 9a408e61 - (HEAD -> master, origin/master, origin/HEAD) [packet manager] Handle enabled/disabled action for generated/transient units (sonic-net#4272) (8 hours ago) [Nazarii Hnydyn]
```
#### How I did it
#### How to verify it
#### Description for the changelog
…tically (sonic-net#26221)

#### Why I did it
src/sonic-sairedis
```
* 4758c3cc - (HEAD -> master, origin/master, origin/HEAD) BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on Tomahawk-1 (BCM56960) legacy platforms (sonic-net#1789) (20 hours ago) [Liping Xu]
* 78470325 - BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms (sonic-net#1788) (23 hours ago) [Liping Xu]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.