From 2ea3e1d3d5da9b942fb345df259ef2f6adf65586 Mon Sep 17 00:00:00 2001 From: Sonic Build Admin Date: Fri, 20 Mar 2026 20:57:59 +0000 Subject: [PATCH] Fix rsyslogd memory growth in syncd swss containers over long term #### Why I did it 1. We observed long-term rsyslogd memory growth in syncd container. 2. Deep diagnostics (impstats) showed imuxsock.ratelimit.numratelimiters growing continuously (about ~2/min), while queue depth stayed near zero, indicating sender/PID churn rather than queue backlog. 3. phcsync.sh runs every 60 seconds and repeatedly invokes phc_ctl for /dev/ptp* devices. These short-lived process invocations contribute to new sender identities seen by imuxsock, which correlates with ratelimiter-state growth and memory increase over time because of data structures stored by rsyslogd for ratelimiting. ##### Work item tracking - Microsoft ADO **(number only)**: #### How I did it - Updated phcsync.sh in SONiC to keep successful phc_ctl execution silent: - Use `phc_ctl -q -Q ... >/dev/null 2>&1` - Keep explicit error handling and error logs on non-zero exit. - Added stable logger identity in service debug helpers: - `logger -i "$$" -- "$1"` in syncd_common.sh and swss.sh. This reduces per-call sender churn during script execution phases (start/wait/stop). syncd Every second we currently see following log from syncd and it creates a new ratelimiter context in rsyslogd because of new PID each time ``` syslog.1:15477:2026 Mar 2 22:25:01.754471 sonic NOTICE syncd#phc_ctl: [561375.455] set clock time to 1772490301.754287973 or Mon Mar 2 22:25:01 2026 ``` logger commands before ``` Mar 04 03:55:44 sonic root[1775781]: Starting swss service... Mar 04 03:55:44 sonic root[1775785]: Locking /tmp/swss-syncd-lock from swss service Mar 04 03:55:44 sonic root[1775792]: Locked /tmp/swss-syncd-lock (10) from swss service Mar 04 03:55:44 sonic root[1775816]: Warm boot flag: swss false. Mar 04 03:55:44 sonic root[1775822]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ... Mar 04 03:55:45 sonic root[1776045]: Started swss service... Mar 04 03:55:45 sonic root[1776051]: Unlocking /tmp/swss-syncd-lock (10) from swss service ``` After ``` Mar 04 03:58:52 sonic root[1891651]: Starting swss service... Mar 04 03:58:52 sonic root[1891651]: Locking /tmp/swss-syncd-lock from swss service Mar 04 03:58:52 sonic root[1891651]: Locked /tmp/swss-syncd-lock (10) from swss service Mar 04 03:58:52 sonic root[1891651]: Warm boot flag: swss false. Mar 04 03:58:52 sonic root[1891651]: Flushing APP, ASIC, COUNTER, CONFIG, and partial STATE databases ... Mar 04 03:58:53 sonic root[1891651]: Started swss service... Mar 04 03:58:53 sonic root[1891651]: Unlocking /tmp/swss-syncd-lock (10) from swss service ``` #### How to verify it - imuxsock.ratelimit.numratelimiters in syncd should stop continuous growth (or reduce drastically). #### Which release branch to backport (provide reason below if selected) - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 - [X] 202511 Signed-off-by: Sonic Build Admin --- files/scripts/swss.sh | 4 +++- files/scripts/syncd_common.sh | 4 +++- platform/mellanox/docker-syncd-mlnx/phcsync.sh | 3 ++- 3 files changed, 8 insertions(+), 3 deletions(-) diff --git a/files/scripts/swss.sh b/files/scripts/swss.sh index 5f4376e7ced..70ceb40521e 100755 --- a/files/scripts/swss.sh +++ b/files/scripts/swss.sh @@ -14,7 +14,9 @@ TSA_TSB_SERVICE="startup_tsa_tsb.service" function debug() { - /usr/bin/logger $1 + # Use --id=$$ so all messages from this script share the parent shell's PID, + # preventing rsyslog imuxsock ratelimiter memory growth. + /usr/bin/logger --id=$$ -- "$1" /bin/echo `date` "- $1" >> ${DEBUGLOG} } diff --git a/files/scripts/syncd_common.sh b/files/scripts/syncd_common.sh index d3a8b0df7c4..9d62ea03b8e 100755 --- a/files/scripts/syncd_common.sh +++ b/files/scripts/syncd_common.sh @@ -15,7 +15,9 @@ function debug() { - /usr/bin/logger $1 + # Use --id=$$ so all messages from this script share the parent shell's PID, + # preventing rsyslog imuxsock ratelimiter memory growth. + /usr/bin/logger --id=$$ -- "$1" /bin/echo `date` "- $1" >> ${DEBUGLOG} } diff --git a/platform/mellanox/docker-syncd-mlnx/phcsync.sh b/platform/mellanox/docker-syncd-mlnx/phcsync.sh index 910399b0218..2923e4e5371 100755 --- a/platform/mellanox/docker-syncd-mlnx/phcsync.sh +++ b/platform/mellanox/docker-syncd-mlnx/phcsync.sh @@ -82,7 +82,8 @@ while :; do if [[ "$clock_name" != "mlx5_ptp" ]]; then # set CLOCK_REALTIME - "$PHC_CTL" "$dev" set 2>/dev/null + # Keep successful syncs silent to avoid rsyslogd ratelimit memory issue due to PID churn. + "$PHC_CTL" -q -Q "$dev" set >/dev/null PHC_CTL_EXIT_CODE=$? if [[ $PHC_CTL_EXIT_CODE -ne 0 ]]; then echo "Error: Failed to sync clock for $dev (phc_ctl exit code: $PHC_CTL_EXIT_CODE)" >&2