Fix rsyslogd memory growth in syncd swss containers over long term#14
Closed
tirupatihemanth wants to merge 13 commits intomasterfrom
Closed
Fix rsyslogd memory growth in syncd swss containers over long term#14tirupatihemanth wants to merge 13 commits intomasterfrom
tirupatihemanth wants to merge 13 commits intomasterfrom
Conversation
8236e78 to
17efd90
Compare
eb207a6 to
a72dfb4
Compare
vivekrnv
approved these changes
Mar 4, 2026
tirupatihemanth
pushed a commit
that referenced
this pull request
Mar 13, 2026
…net#25643) * [build] Add build timing report and dependency analysis tools Add three scripts for build performance instrumentation: - scripts/build-timing-report.sh: Parse per-package timing from build logs (HEADER/FOOTER timestamps), generate sorted duration table, phase breakdown, parallelism timeline, and CSV export. - scripts/build-dep-graph.py: Parse rules/*.mk dependency graph, compute critical path, fan-out/fan-in bottleneck analysis, and generate DOT/JSON output for visualization. - scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O, and Docker container count during builds for resource utilization analysis. Add "make build-report" target to slave.mk that runs the timing report and dependency analysis after a build completes. Example output from a VS build on 24-core/30GB machine: - 210 packages built in 53m wall time (173m CPU) - Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4) - Critical path: 14 packages deep (libnl -> libswsscommon -> utilities) - Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents Signed-off-by: Rustiqly <[email protected]> * Address Copilot review: fix 17 bugs in build analysis scripts - Use free -m with division instead of free -g to avoid rounding (#1) - Add = and ?= to Makefile dependency regex patterns (#2, #7) - CPU calculation now uses /proc/stat delta (two reads) (#3, #14) - Fix misleading 'critical path estimate' comment (#4) - Fix parallelism timeline comment (60s not 10s) (#5) - Include after-relationship packages in fan stats (#6) - Guard disk I/O division by zero when INTERVAL<=1 (#8) - Remove unused elapsed_line variable (#9) - Remove redundant LIBSWSSCOMMON_DBG check (#10) - Remove active_make_jobs from CSV header comment (#11) - Wire up _RDEPENDS parsing to build reverse deps (#12) - Remove unnecessary 'if v' filter on rdeps JSON (#13) - Remove unused REPORT_FORMAT parameter (#15) - Add cycle detection to critical path algorithm (#16) - Add execute permission check for companion scripts (#17) Signed-off-by: Rustiqly <[email protected]> --------- Signed-off-by: Rustiqly <[email protected]> Co-authored-by: Rustiqly <[email protected]>
Why I did it After sonic-net#25876, there are still some left over vulnerabilities. Majority is gnoic and go library. Some new introduces by the upgrade of protobuf==6.31.1. This PR will try to address them Signed-off-by: Austin Pham <[email protected]>
Why I did it Gate LLDP and SNMP docker features behind INCLUDE_LLDP and INCLUDE_SNMP build flags, consistent with how other optional features (e.g., SFLOW, NAT, TEAMD) are already gated. This allows platforms like nvidia-bluefield and pensando to properly disable these features via their DISABLED_FEATURE_FLAGS. A follow-up changed needed after sonic-net#25032 Fixes: sonic-net#25891 Work item tracking Microsoft ADO (number only): How I did it Removed lldp and snmp from the static features list in init_cfg.json.j2 and added them as conditional features gated by include_lldp and include_snmp variables. Added INCLUDE_SNMP ?= y and INCLUDE_LLDP ?= y config flags in rules/config (defaulting to enabled). Exported include_snmp and include_lldp variables in slave.mk so they are available during image build. Added INCLUDE_SNMP and INCLUDE_LLDP to DISABLED_FEATURE_FLAGS in nvidia-bluefield and pensando platform makefiles. How to verify it Build nvidia-bluefield or pensando images and verify LLDP/SNMP are disabled.
… image (sonic-net#26014) * BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on TH1 legacy image Add SAI_STATS_EXT_SWITCH_SUPPORTED=0 to sai.profile for Arista 7060cx (BCM56960/Tomahawk-1) to disable sai_get_stats_ext for switch objects. The legacy SAI binary crashes when FlexCounter calls sai_get_stats_ext on switch objects during polling. The runtime guard is implemented in sonic-sairedis PR sonic-net#1789. Signed-off-by: Liping Xu <[email protected]> * BROADCOM_LEGACY_SAI_COMPAT: Add missing Q32 HWSKU sai.profile.j2 keys for TH1 Arista-7060CX-32S-Q32 uses a Jinja2 template (sai.profile.j2) rather than a static sai.profile. Add SAI_STATS_ST_CAPABILITY_SUPPORTED=0 and SAI_STATS_EXT_SWITCH_SUPPORTED=0 to cover Q32 as well as the other Arista-7060CX-32S HWSKUs. Signed-off-by: Liping Xu <[email protected]> * BROADCOM_LEGACY_SAI_COMPAT: Fix Q32 sai.profile.j2 - restore single quotes, keep SAI keys Restore original Jinja2 single-quote style (changed unintentionally in the previous commit). Only intended change is adding SAI_STATS_ST_CAPABILITY_SUPPORTED=0 and SAI_STATS_EXT_SWITCH_SUPPORTED=0 for TH1/BCM56960. Signed-off-by: Liping Xu <[email protected]> --------- Signed-off-by: Liping Xu <[email protected]>
…net#25357) Why I did it: Improve ACL YANG model by enforcing that TCP_FLAGS can only be used in ACL table types that explicitly support this match field, ensuring correct model behavior and configuration validation. How I did it: Updated the ACL YANG model to add a must constraint for TCP_FLAGS. Added/updated test cases and configuration to verify the new constraint. How to verify it: Validate the ACL YANG model with various configs. Incorrect usage of TCP_FLAGS now triggers a must constraint error. Signed-off-by: Xincun Li <[email protected]>
…t#25963) What is the motivation for this PR: - Add CI build target for aspeed-arm64 so sonic-aspeed-arm64.bin is produced for AST2700 BMC; platform support exists but no CI target. How did you do it: - Added aspeed-arm64 build target in azure-pipelines.yml with PLATFORM_NAME=aspeed and PLATFORM_ARCH=arm64 on sonicso1ES-arm64 pool. How did you verify/test it: - Pipeline run produced sonic-aspeed-arm64.bin; installed on NextHop B27 and verified 7 containers running. Signed-off-by: - zitingguo <[email protected]>
…e missing (sonic-net#26180) Why I did it Currently, if any of restapi's root CA cert, server cert, or server key are missing, restapi_watchdog reports the restapi's status as unhealthy. This can cause some issues in production (e.g., the restapi container might be repeatedly restarted). Work item tracking Microsoft ADO (number only): 37113632 How I did it restapi_watchdog first looks at restapi's cert and key paths in CONFIG DB. If there is an error (very unlikely) or any of these paths are not defined, it assumes that the certs are missing. Otherwise, it checks if they exist. If any of the certs or the server key are missing, it returns an HTTP 200 OK response. Otherwise, it will check the restapi's status. Note: restapi_watchdog can only check the /etc/sonic/credentials/ directory. If the cert or key paths point to files in other directories, then restapi_watchdog assumes that they exist. How to verify it Build the restapi_watchdog docker image and create a restapi_watchdog container on your switch: $ docker load < docker-restapi-watchdog.gz $ docker create --net=host -t -v /etc/localtime:/etc/localtime:ro -v /etc/sonic/credentials:/etc/sonic/credentials:ro --name="restapi_watchdog" docker-restapi-watchdog:latest $ docker start restapi_watchdog
…onic-net#26144) * [lldp] Fix transient MAC address Port ID during LLDP daemon startup When lldpd starts (or restarts), it defaults to using MAC addresses as Port IDs for all interfaces. The lldpmgrd daemon later reconfigures each port with the correct interface alias via lldpcli, but there is a 2-3 second window where lldpd has already auto-resumed and sent LLDP frames with MAC-based Port IDs. This causes peers to see a transient MSAP change: first a neighbor entry with MAC Port ID appears, then a shutdown frame, then a new entry with the correct interface name. This neighbor flapping can trigger monitoring alerts and confuse network management systems. Root cause: lldpd internally auto-resumes after processing all config file lines. The existing pause directive in lldpd.conf is redundant (lldpd starts paused by default) and gets overridden by the auto-resume. Since no port ID configs exist in the initial config, the first frames use the default MAC-based Port IDs. Fix: Add portidsubtype configuration for all front-panel ports directly in the lldpd.conf.j2 Jinja2 template. These configs are processed during lldpd startup config loading, before the auto-resume fires, ensuring the very first LLDP frame carries the correct interface alias as Port ID. Verified on Arista-7260CX3-C64 (SONiC.20251110.15) with tcpdump capture showing the first LLDP frame after restart has Subtype Local (7) with correct port alias (e.g., Ethernet1/1) instead of MAC address. - Sort PORT iteration with |sort for deterministic config ordering - Filter out special ports (Ethernet-IB/Rec/BP) matching lldpmgrd behavior - Fallback to port_name when alias is missing or empty - Add unit test with PORT table covering aliases, special ports, and missing alias scenarios Signed-off-by: Zhaohui Sun <[email protected]>
How I did it 562f89b (HEAD -> main, origin/main, origin/HEAD) [H6-128] Change HWSKU name and remove port I2C clock setting (sonic-net#44) 1babc9b [H6-64] Update thermal algo, FW upgrade, PSU modules (sonic-net#43) d742aeb [H6-128] Add support for H6-128 Platform (sonic-net#42) f2dbdd7 Fix the UnboundLocalError to address thermalctld stacktrace 54e453a Add _cpm_firmware_upgrade_reboot_IMMs method 908d62a Fix the Unable_to_reach_CPM reboot-cause show issue Signed-off-by: y7zhou <[email protected]>
Why I did it arp_update logs for static routes showed entries like: 98 static route nexthop not resolved ... which could be misinterpreted as 98 unreachable servers. This change modifies the logger() function to remove PID argument. Work item tracking Microsoft ADO (number only): How I did it PID in arp_update logs is not needed. Keeping only tag makes logs cleaner while preserving filterability (-t arp_update). How to verify it Signed-off-by: Janet Cui <[email protected]>
… automatically (sonic-net#26219) #### Why I did it src/sonic-platform-common ``` * 972ff46 - (HEAD -> master, origin/master, origin/HEAD) Add bank parameter to SfpBase and SfpOptoeBase class (sonic-net#632) (23 hours ago) [Bobby McGonigle] ``` #### How I did it #### How to verify it #### Description for the changelog
…atically (sonic-net#26224) #### Why I did it src/sonic-utilities ``` * 9a408e61 - (HEAD -> master, origin/master, origin/HEAD) [packet manager] Handle enabled/disabled action for generated/transient units (sonic-net#4272) (8 hours ago) [Nazarii Hnydyn] ``` #### How I did it #### How to verify it #### Description for the changelog
…tically (sonic-net#26221) #### Why I did it src/sonic-sairedis ``` * 4758c3cc - (HEAD -> master, origin/master, origin/HEAD) BROADCOM_LEGACY_SAI_COMPAT: Fix sai_get_stats_ext crash on Tomahawk-1 (BCM56960) legacy platforms (sonic-net#1789) (20 hours ago) [Liping Xu] * 78470325 - BROADCOM_LEGACY_SAI_COMPAT: Fix sai_query_stats_st_capability crash on Tomahawk-1 (BCM56960) legacy platforms (sonic-net#1788) (23 hours ago) [Liping Xu] ``` #### How I did it #### How to verify it #### Description for the changelog
Signed-off-by: Hemanth Kumar Tirupati <[email protected]>
a72dfb4 to
2ef93c5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why I did it
Work item tracking
How I did it
phc_ctl -q -Q ... >/dev/null 2>&1logger -i "$$" -- "$1"in syncd_common.sh and swss.sh. This reduces per-call sender churn during script execution phases (start/wait/stop).syncd
Every second we currently see following log from syncd and it creates a new ratelimiter context because of new PID each time
logger commands
before
After
How to verify it
Which release branch to backport (provide reason below if selected)