[syncd.sh,pmon.service] Prevent pmon from starting ahead of syncd#8
Merged
stephenxs merged 1 commit intopmon-dependency-syncdfrom Sep 19, 2019
Merged
Conversation
During system starting, pmon isn't supposed to start ahead of syncd starting in order to avoid racing condition between syncd and pmon. Currently it is done by killing pmon if is alive when syncd is starting. However such implementation is still risky. Consider the following flow: 1. pmon is inactive when syncd.sh is checking. but syncd.sh is scheduled out somehow just ahead of "chipdown" called 2. systemd is switched in and starts pmon service 3. at this point, pmon and syncd are running simultaneously, critical section broken and racing condition formed To prevent that issue, ony solution is to add syncd as "After" in pmon.service, which ensure that whenever pmon starts syncd has been started. However, dong so requires to defer starting pmon.service after syncd.service has fully started otherwise a deadlock is formed as following: 1. syncd.sh starts pmon ahead of itself fully started, while 2. pmon not being able to start due to syncd, one of its "After", not fully started. 3. as a result, syncd and pmon have to wait for each other forever To solve that, move starting pmon.service to "wait()" so that pmon is started after syncd fully started, breaking the deadlock.
|
@stephenxs Maybe it is a good idea to have the entire synchronization flow based on systemd dependencies and remove all the rest stuff from syncd.sh? The pmon stop for warm-fast flows can be done in a dedicated scripts. Could you please check sych an option and compare it to existing approach? If you have any concerns we can discuss them. |
|
The idea to have Wants=pmon.service in syncd.service and After=syncd.service with Requires=syncd.service in pmon.service. This will definitely simplify the entire flow. What do you think? |
stephenxs
pushed a commit
that referenced
this pull request
Dec 10, 2020
This update brings in the following commits. 86c1108 Enable arm architecture to build in addition to amd64 (#37) 4acb2c3 fix bugs and enhance Transformer (#35) 49e5a22 ygot related enhancements and fixes (#34) 51224de Fix ietf yang search path for cvl schema builds (#32) 3c6cdb3 CVL Changes #8: 'must' and 'when' expression evaluation (#31) dabf231 CVL Changes #7: 'leafref' evaluation (#28) 6f9535f CVL Changes #6: Customized Xpath Engine integration (#27) 5e2466b DB-Layer fixes/enhancements (#26) 9a27302 CVL Changes #4: Implementation of new CVL APIs (#22) dbf1093 Translib support for authorization, yang versioning and Delete flag (#21) 80f369e CVL Changes #5: YParser enhancement (#23) 904ce18 CVL Changes #3: Multi-db instance support (#20) 9d24a34 CVL Changes #2: YValidator infra changes for evaluating xpath expression (#19) f3fc40f CVL Changes #1: Initial CVL code reorganization and common infra changes (#18) 4922601 Bulk and RPC API support in translib (#16) 1d730df RFC7895 yang module library implementation (#15)
stephenxs
pushed a commit
that referenced
this pull request
Nov 16, 2021
Update Barefoot platform support for Bullseye and 5.10 kernel, and add python3-venv.
stephenxs
pushed a commit
that referenced
this pull request
Feb 10, 2022
* [BFN] Updated platform APIs impl Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Extended BFN platform SFP APIs implementation * Update sfp.py * [BFN] Extended SFP platform plugin implementation Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] Extended Fans platform plugin implementation * [BFN] divided classes Fan and FanDrawer into 2 files * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> What I did Add get_model() function Add get_low_critical_threshold() function Change __get(...) function. How I did it Differnece from previous implementation of __get(...) function is return real value or -9999.9 if value is not provided by thrift API * Add get_presence() function and revised __get() function Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * [BFN] Updated PSU platform APIs impl Signed-off-by: Dmytro Lytvynenko <dmytrox.lytvynenko@intel.com> * Added BFN PSU cache (#9) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] Fans and Fantray platform APIs update (#7) * [BFN] Updated SFP platform APIs (#10) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * [BFN] Updated platform API for thermal (#8) * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * Revert "[BFN] Fans and Fantray platform APIs update (#7)" (#11) This reverts commit c62a733. * Add support health monitor system (#15) Signed-off-by: Petro Bratash <petrox.bratash@intel.com> * Update chassis.py * [BFN] Updated FANs and FAN Tray platform API (#14) * Fix fix_alignment (#17) Signed-off-by: Petro Bratash <petrox.bratash@intel.com> * [BFN] Improvement show environment (#16) * Added PSU temperature skip into platform.json (#18) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Do not skip psud on Newport Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [BFN] fix fan status from Not OK to Ok (#19) * [BFN] Updated SFP platform plugin (#13) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * [DPB] Fix typo for Ethernet0 2x200G[100G,40G] breakout mode (#21) Signed-off-by: Mykola Gerasymenko <mykolax.gerasymenko@intel.com> * [barefoot] Tmp fix vendor_rev (#22) Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * Fixed python issues in sonic_platform/fan_drawer.py Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Updated fan_drawer.py * Fixing trailing white spaces in fan_drawer.py * [BFN] Fix thrift for SFPs API Signed-off-by: Volodymyr Boyko <volodymyrx.boiko@intel.com> * In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * [Newport] Thermal manager (#23) * Signed-off-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> * Revert "In platform.json, replaced 'false' with '0' to workaround ast.literal_eval() issue" This reverts commit 1e73127. * Removed 'controllable' options from platform.json to fix factory default config generation Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> * Update thermal_manager.py * Migrated SFP plugin to sonic_xcvr API (#30) Signed-off-by: Andriy Kokhan <andriyx.kokhan@intel.com> Co-authored-by: KostiantynYarovyiBf <kostiantynx.yarovyi@intel.com> Co-authored-by: Vadym Yashchenko <vadymx.yashchenko@intel.com> Co-authored-by: Dmytro Lytvynenko <dmytrox.lytvynenko@intel.com> Co-authored-by: Volodymyr Boiko <volodymyrx.boiko@intel.com> Co-authored-by: Petro Bratash <petrox.bratash@intel.com> Co-authored-by: Mykola Gerasymenko <mykolax.gerasymenko@intel.com>
stephenxs
pushed a commit
that referenced
this pull request
Dec 23, 2024
…et#21095) Adding the below fix from FRR FRRouting/frr#17297 This is to fix the following crash which is a statistical issue [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'. Program terminated with signal SIGABRT, Aborted. #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 [Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))] (gdb) bt #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678 #4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352 #5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258 #6 route_next (node=<optimized out>) at ../lib/table.c:436 #7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410 #8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020") at ../zebra/interface.c:312 #9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867 #10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221 #11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810 #12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990 #13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198 #14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
stephenxs
pushed a commit
that referenced
this pull request
Feb 19, 2025
…et#21405) <!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Adding the below fix from FRR FRRouting/frr#17297 This is to fix the following crash which is a statistical issue ``` [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'. Program terminated with signal SIGABRT, Aborted. #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 [Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))] (gdb) bt #0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6 #1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 #2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6 #3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678 #4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352 #5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258 #6 route_next (node=<optimized out>) at ../lib/table.c:436 #7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410 #8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020") at ../zebra/interface.c:312 #9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867 #10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221 #11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810 #12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990 #13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198 #14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478 ``` ##### Work item tracking - Microsoft ADO **(number only)**: #### How I did it Added patch. #### How to verify it Running BGP tests. <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 - [ ] 202305 #### Tested branch (Please provide the tested image version) <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
stephenxs
pushed a commit
that referenced
this pull request
Mar 9, 2026
…net#25643) * [build] Add build timing report and dependency analysis tools Add three scripts for build performance instrumentation: - scripts/build-timing-report.sh: Parse per-package timing from build logs (HEADER/FOOTER timestamps), generate sorted duration table, phase breakdown, parallelism timeline, and CSV export. - scripts/build-dep-graph.py: Parse rules/*.mk dependency graph, compute critical path, fan-out/fan-in bottleneck analysis, and generate DOT/JSON output for visualization. - scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O, and Docker container count during builds for resource utilization analysis. Add "make build-report" target to slave.mk that runs the timing report and dependency analysis after a build completes. Example output from a VS build on 24-core/30GB machine: - 210 packages built in 53m wall time (173m CPU) - Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4) - Critical path: 14 packages deep (libnl -> libswsscommon -> utilities) - Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> * Address Copilot review: fix 17 bugs in build analysis scripts - Use free -m with division instead of free -g to avoid rounding (#1) - Add = and ?= to Makefile dependency regex patterns (#2, #7) - CPU calculation now uses /proc/stat delta (two reads) (#3, #14) - Fix misleading 'critical path estimate' comment (#4) - Fix parallelism timeline comment (60s not 10s) (#5) - Include after-relationship packages in fan stats (#6) - Guard disk I/O division by zero when INTERVAL<=1 (#8) - Remove unused elapsed_line variable (#9) - Remove redundant LIBSWSSCOMMON_DBG check (#10) - Remove active_make_jobs from CSV header comment (#11) - Wire up _RDEPENDS parsing to build reverse deps (#12) - Remove unnecessary 'if v' filter on rdeps JSON (#13) - Remove unused REPORT_FORMAT parameter (#15) - Add cycle detection to critical path algorithm (#16) - Add execute permission check for companion scripts (#17) Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> --------- Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com> Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
- What I did
Prevent pmon from starting ahead of syncd by adding syncd.service as "After" of pmon.service.
This PR is one of the options that solve the comments of [syncd.sh] stop pmon ahead of syncd in flows except warm reboot #7
- How I did it
During system starting, pmon isn't supposed to start ahead of syncd starting in order to avoid racing condition between syncd and pmon.
Currently it is done by killing pmon if is alive when syncd is starting. However such implementation is still risky. Consider the following flow:
To prevent that issue, ony solution is to add syncd as "After" in pmon.service, which ensure that whenever pmon starts syncd has been started.
However, dong so requires to defer starting pmon.service after syncd.service has fully started otherwise a deadlock is formed as following:
To solve that, move starting pmon.service to "wait()" so that pmon is started after syncd fully started, breaking the deadlock.
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)