Skip to content

Thermal policy#2

Closed
Junchao-Mellanox wants to merge 14 commits intothermal-controlfrom
thermal-policy
Closed

Thermal policy#2
Junchao-Mellanox wants to merge 14 commits intothermal-controlfrom
thermal-policy

Conversation

@Junchao-Mellanox
Copy link
Copy Markdown
Owner

- What I did
Change thermalctld make rules to allow it perform unit test

- How I did it

- How to verify it

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

pra-moh and others added 14 commits November 26, 2019 14:11
1. thermals are not added to thermal list, cause: thermal initliazer not called
2. add fan of psu to fan list rather than to set it into _fan
1. thermals are not added to thermal list, cause: thermal initliazer not called
2. add fan of psu to fan list rather than to set it into _fan
… unsupported and treat 0.0 as N/A in high_threshold in xsfp module
Junchao-Mellanox pushed a commit that referenced this pull request Dec 14, 2020
This update brings in the following commits.

86c1108 Enable arm architecture to build in addition to amd64 (#37)
4acb2c3 fix bugs and enhance Transformer (#35)
49e5a22 ygot related enhancements and fixes (#34)
51224de Fix ietf yang search path for cvl schema builds (#32)
3c6cdb3 CVL Changes #8: 'must' and 'when' expression evaluation (#31)
dabf231 CVL Changes #7: 'leafref' evaluation (#28)
6f9535f CVL Changes #6: Customized Xpath Engine integration (#27)
5e2466b DB-Layer fixes/enhancements (#26)
9a27302 CVL Changes #4: Implementation of new CVL APIs (#22)
dbf1093 Translib support for authorization, yang versioning and Delete flag (#21)
80f369e CVL Changes #5: YParser enhancement (#23)
904ce18 CVL Changes #3: Multi-db instance support (#20)
9d24a34 CVL Changes #2:  YValidator infra changes for evaluating xpath expression (#19)
f3fc40f CVL Changes #1: Initial CVL code reorganization and common infra changes (#18)
4922601 Bulk and RPC API support in translib (#16)
1d730df RFC7895 yang module library implementation (#15)
Junchao-Mellanox pushed a commit that referenced this pull request Nov 12, 2021
Upgrade DellEMC platforms to bullseye.
Junchao-Mellanox pushed a commit that referenced this pull request Oct 25, 2022
#### Why I did it
Update sonic-host-services submodule to include below commits:
```
bc8698d Merge pull request #21 from abdosi/feature
557a110 Fix the issue where if dest port is not specified in ACL rule than for multi-asic where we create NAT rule to forward traffic from Namespace to host fail with exception.
6e45acc (master) Merge pull request #14 from abdosi/feature
4d6cad7 Merge remote-tracking branch 'upstream/master' into feature
bceb13e Install libyang to azure pipeline (#20)
82299f5 Merge pull request #13 from SuvarnaMeenakshi/cacl_fabricns
15d3bf4 Merge branch 'master' into cacl_fabricns
de54082 Merge pull request #16 from ZhaohuiS/feature/caclmgrd_external_client_warning_log
b4b368d Add warning log if destination port is not defined
d4bb96d Merge branch 'master' into cacl_fabricns
35c76cb Add unit-test and fix typo.
17d44c2 Made Changes to be Python 3.7 compatible
978afb5 Aligning Code
1fbf8fb Merge remote-tracking branch 'upstream/master' into feature
7b8c7d1 Added UT for the changes
91c4c42 Merge pull request #9 from ZhaohuiS/feature/caclmgrd_external_client
7c0b56a Add 4 test cases for external_client_acl, including single port and port range for ipv4 and ipv6
b71e507 Merge remote-tracking branch 'origin/master' into HEAD
d992dc0 Merge branch 'master' into feature/caclmgrd_external_client
bd7b172 DST_PORT is configuralbe in json config file for EXTERNAL_CLIENT_ACL
f9af7ae [CLI] Move hostname, mgmt interface/vrf config to hostcfgd (#2)
70ce6a3 Merge pull request #10 from sujinmkang/cold_reset
29be8d2 Added Support to render Feature Table using Device running metadata. Also added support to render 'has_asic_scope' field of Feature Table.
3437e35 [caclmgrd][chassis]: Add ip tables rules to accept internal docker traffic from fabric asic namespaces.
8720561 Fix and add hardware reboot cause determination tests
0dcc7fe remove the empty bracket if no hardware reboot cause minor
e47d831 fix the wrong expected result comparision
ef86b53 Fix startswith Attribute error
8a630bb fix mock patch
8543ddf update the reboot cause logic and update the unit test
53ad7cd fix the mock patch function
7c8003d fix the reboot-cause regix for test
1ba611f fix typo
25379d3 Add unit test case
a56133b Add hardware reboot cause as actual reboot cause for soft reboot failed
c7d3833 Support Restapi/gnmi control plane acls
f6ea036 caclmgrd: Don't block traffic to mgmt by default
a712fc4 Update test cases
adc058b caclmgrd: Don't block traffic to mgmt by default
06ff918 Merge pull request #7 from bluecmd/patch-1
e3e23bc ci: Rename sonic-buildimage repository
e83a858 Merge pull request #4 from kamelnetworks/acl-ip2me-test
f5a2e50 [caclmgrd]: Tests for IP2ME rules generation
```
Junchao-Mellanox pushed a commit that referenced this pull request Jan 12, 2023
…onic-net#11684)

The follwing commit is added in sonic-utilities
b034f0c (HEAD -> 202012, origin/202012) [config][muxcable] add support
to enable/disable ycable telemetry (#2… (sonic-net#2304)

The follwing commit is added in sonic-platform-daemons
978667c (HEAD -> 202012, origin/202012) [ycabled] add capability to
enable/disable telemetry (sonic-net#279) (sonic-net#280

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>

Signed-off-by: vaibhav-dahiya <vdahiya@microsoft.com>
Junchao-Mellanox pushed a commit that referenced this pull request Dec 11, 2024
…et#21095)

Adding the below fix from FRR FRRouting/frr#17297

This is to fix the following crash which is a statistical issue

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4  0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5  0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6  route_next (node=<optimized out>) at ../lib/table.c:436
#7  route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8  0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
    at ../zebra/interface.c:312
#9  0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
Junchao-Mellanox pushed a commit that referenced this pull request Dec 24, 2024
To fix a statistical issue. The original fix was done in FRRouting/frr#17297. However to accommodate 8.5.4 the patch in the PR was added.

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4  0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5  0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6  route_next (node=<optimized out>) at ../lib/table.c:436
#7  route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8  0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
    at ../zebra/interface.c:312
#9  0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
Junchao-Mellanox pushed a commit that referenced this pull request Feb 26, 2025
…et#21405)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it

Adding the below fix from FRR FRRouting/frr#17297

This is to fix the following crash which is a statistical issue

```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6 route_next (node=<optimized out>) at ../lib/table.c:436
#7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
 at ../zebra/interface.c:312
#9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Added patch.

#### How to verify it
Running BGP tests.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Junchao-Mellanox pushed a commit that referenced this pull request Mar 7, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

Adding the below fix from FRR FRRouting/frr#17297

This is to fix the following crash which is a statistical issue

```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4  0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5  0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6  route_next (node=<optimized out>) at ../lib/table.c:436
#7  route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8  0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
    at ../zebra/interface.c:312
#9  0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Added patch.

#### How to verify it
Running BGP tests.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Junchao-Mellanox pushed a commit that referenced this pull request May 7, 2025
…sue. (sonic-net#22342)

Fix TACACS config revert to old config when device reboot issue.

#### Why I did it
Fix following bug:

1. When SONiC OS upgrade, old TACACS config will save to /etc/sonic/old_config/tacacs.json
2. After device reboot, TACACS config service (https://github.com/sonic-net/sonic-buildimage/blob/master/files/build_templates/tacacs-config.service) will restore TACACS config from /etc/sonic/old_config/tacacs.json, but this file will keep no change after restore TACACS config.
3. If TACACS service changed by user, because of #2, if device reboot again, the TACACS config been reverted back to old config in /etc/sonic/old_config/tacacs.json

Note: the TACACS config does not revert immediately after reboot, it will delay 5min 30sec:
https://github.com/sonic-net/sonic-buildimage/blob/master/files/build_templates/tacacs-config.timer

##### Work item tracking
- Microsoft ADO **(number only)**:32338799

#### How I did it
Move /etc/sonic/old_config/tacacs.json to /etc/sonic/old_config/tacacs.json_backup

#### How to verify it
Pass all test case.
Manually verify with following steps:

admin@vlab-01:~$ show tacacs
TACPLUS global auth_type login
TACPLUS global timeout 5 (default)
TACPLUS global passkey testing123

TACPLUS_SERVER address 10.250.0.102
               priority 1
               tcp_port 49

admin@vlab-01:~$ echo '
> {
>     "TACPLUS": {"global": { "auth_type": "login", "passkey": "12345" } }
> }' > /etc/sonic/old_config/tacacs.json
admin@vlab-01:~$ cat /etc/sonic/old_config/tacacs.json

{
    "TACPLUS": {"global": { "auth_type": "login", "passkey": "12345" } }
}

// then reboot device and wait for 6 minutes, because the TACACS config service will delay 5min 30sec after reboot:
https://github.com/sonic-net/sonic-buildimage/blob/master/files/build_templates/tacacs-config.timer

admin@vlab-01:~$ ls /etc/sonic/old_config/tacacs.json
ls: cannot access '/etc/sonic/old_config/tacacs.json': No such file or directory
admin@vlab-01:~$ show tacacs
TACPLUS global auth_type login
TACPLUS global timeout 5 (default)
TACPLUS global passkey 12345

TACPLUS_SERVER address 10.250.0.102
               priority 1
               tcp_port 49

#### Description for the changelog
Fix TACACS config revert to old config when device reboot issue.
Junchao-Mellanox pushed a commit that referenced this pull request Mar 6, 2026
…net#25643)

* [build] Add build timing report and dependency analysis tools

Add three scripts for build performance instrumentation:

- scripts/build-timing-report.sh: Parse per-package timing from build
  logs (HEADER/FOOTER timestamps), generate sorted duration table,
  phase breakdown, parallelism timeline, and CSV export.

- scripts/build-dep-graph.py: Parse rules/*.mk dependency graph,
  compute critical path, fan-out/fan-in bottleneck analysis, and
  generate DOT/JSON output for visualization.

- scripts/build-resource-monitor.sh: Sample CPU, memory, disk I/O,
  and Docker container count during builds for resource utilization
  analysis.

Add "make build-report" target to slave.mk that runs the timing
report and dependency analysis after a build completes.

Example output from a VS build on 24-core/30GB machine:
- 210 packages built in 53m wall time (173m CPU)
- Max concurrency: 5 (with SONIC_CONFIG_BUILD_JOBS=4)
- Critical path: 14 packages deep (libnl -> libswsscommon -> utilities)
- Top bottleneck: LIBSWSSCOMMON with 48 downstream dependents

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

* Address Copilot review: fix 17 bugs in build analysis scripts

- Use free -m with division instead of free -g to avoid rounding (#1)
- Add = and ?= to Makefile dependency regex patterns (#2, #7)
- CPU calculation now uses /proc/stat delta (two reads) (#3, #14)
- Fix misleading 'critical path estimate' comment (#4)
- Fix parallelism timeline comment (60s not 10s) (#5)
- Include after-relationship packages in fan stats (#6)
- Guard disk I/O division by zero when INTERVAL<=1 (#8)
- Remove unused elapsed_line variable (#9)
- Remove redundant LIBSWSSCOMMON_DBG check (#10)
- Remove active_make_jobs from CSV header comment (#11)
- Wire up _RDEPENDS parsing to build reverse deps (#12)
- Remove unnecessary 'if v' filter on rdeps JSON (#13)
- Remove unused REPORT_FORMAT parameter (#15)
- Add cycle detection to critical path algorithm (#16)
- Add execute permission check for companion scripts (#17)

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>

---------

Signed-off-by: Rustiqly <rustiqly@users.noreply.github.com>
Co-authored-by: Rustiqly <rustiqly@users.noreply.github.com>
Junchao-Mellanox pushed a commit that referenced this pull request Mar 31, 2026
…dating udevd rules (sonic-net#26343)

- Why I did it
On SONiC SmartSwitch platforms with DPUs, systemd-udevd crashes with SIGABRT on every reboot when DPU firmware initialization is slow. During the initramfs boot phase, a standalone systemd-udevd daemon is started to handle device discovery. If DPU firmware takes longer than the 60-second udevadm settle timeout (BlueField-3 DPUs can take 120 seconds each in the failure case when they are stuck), the initramfs cannot stop this udevd before switch_root. The stale process survives into the real system but is never chrooted into the overlayfs root, leaving it with a broken filesystem view. When dpu-udev-manager.sh writes udev rules, the stale udevd detects the change and crashes on an assertion in systemd's chase() path resolution (assert(path_is_absolute(p)) at chase.c:648), because dir_fd_is_root() returns false for a process whose root still points to the initramfs rootfs rather than the overlayfs.

This triggers a systemd issue : systemd/systemd#29559 which maintainers doesn't consider as a bug from systemd side. Raising this fix for our usecase.

Core was generated by `/usr/lib/systemd/systemd-udevd --daemon --resolve-names=never'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007f29fe7f695c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f29fe7a1cc2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f29fe78a4ac in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007f29fea50c11 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#4  0x00007f29feb94a8b in chase () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#5  0x00007f29feb956e2 in chase_and_opendir () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#6  0x00007f29feb9a609 in conf_files_list_strv () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#7  0x00007f29fea913e8 in config_get_stats_by_path () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#8  0x0000559f295519cf in ?? ()
#9  0x0000559f29553a77 in ?? ()
#10 0x00007f29fec36055 in ?? () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#11 0x00007f29fec3668d in sd_event_dispatch () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#12 0x00007f29fec394a8 in sd_event_run () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#13 0x00007f29fec396c7 in sd_event_loop () from /usr/lib/x86_64-linux-gnu/systemd/libsystemd-shared-257.so
#14 0x0000559f29545820 in ?? ()
#15 0x00007f29fe78bca8 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#16 0x00007f29fe78bd65 in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#17 0x0000559f29545c51 in ?? ()

- How I did it
Added a kill_stale_udevd() function to dpu-udev-manager.sh that runs before writing the udev rules. It identifies the systemd-managed udevd PID via systemctl show, then kills any other systemd-udevd --daemon process that doesn't match -- these are leftover initramfs instances. If no stale process exists (e.g. DPUs are healthy and the initramfs udevd exited cleanly), the function is a no-op.

- How to verify it
Deploy the image on a SmartSwitch with DPUs in a state where firmware initialization times out (>60s per DPU) by stopping image installation before firmware install step
Reboot the switch
Verify no new systemd-udevd coredumps in /var/core/
Verify the stale process was killed: journalctl -b 0 | grep dpu-udev-manager should show killing stale initramfs udevd PID (systemd udevd is PID )
Verify systemd-udevd.service is healthy: systemctl status systemd-udevd should show active (running)
Verify DPU udev rules were written: cat /etc/udev/rules.d/92-midplane-intf.rules should contain the DPU interface naming rules

Signed-off-by: Hemanth Kumar Tirupati <tirupatihemanthkumar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants