Skip to content

[Mellanox] platform: Enable cache on init#7

Closed
jianyuewu wants to merge 1 commit intomasterfrom
enable_cmis_cache
Closed

[Mellanox] platform: Enable cache on init#7
jianyuewu wants to merge 1 commit intomasterfrom
enable_cmis_cache

Conversation

@jianyuewu
Copy link
Copy Markdown
Owner

@jianyuewu jianyuewu commented May 14, 2025

Depends on: sonic-net/sonic-platform-common#562

Set cache enabled flag.

Why I did it

Centralize the XCVR cache enablement logic in the Mellanox platform API so that CMIS cache is configured automatically at Platform initialization for Nvidia platform.

How I did it

In Platform.init, call CmisApi.set_cache_enabled(True) to set as cache_enabled in cmis.

How to verify it

Build and deploy the updated platform daemon package.

Which release branch to backport (provide reason below if selected)

  • 202411
  • 202412

Tested branch (Please provide the tested image version)

master.637-eafe0ccb9_Internal

Set cache enabled flag.

Signed-off-by: Jianyue Wu <jianyuew@nvidia.com>
@jianyuewu jianyuewu closed this May 21, 2025
jianyuewu pushed a commit that referenced this pull request May 21, 2025
…et#21405)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it

Adding the below fix from FRR FRRouting/frr#17297

This is to fix the following crash which is a statistical issue

```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0 0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3 0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4 0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5 0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6 route_next (node=<optimized out>) at ../lib/table.c:436
#7 route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8 0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
 at ../zebra/interface.c:312
#9 0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Added patch.

#### How to verify it
Running BGP tests.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
jianyuewu pushed a commit that referenced this pull request May 28, 2025
…060X6-64PE-B (sonic-net#22639)

Why I did it
The sensors.conf file was referencing a non-existent NVMe PCI address (nvme-pci-0500) on the Arista-7060X6-64PE-B platform. This mismatch caused pmon#sensord to report repeated I/O errors while attempting to read sensor data for a non-existent device (nvme/#7). Updating the config to use the correct PCI address (nvme-pci-0400) resolves the issue.

Work item tracking
Microsoft ADO (number only): 32849896
How I did it
Modified sensors.conf to change the chip identifier from nvme-pci-0500 to nvme-pci-0400 to match the actual hardware PCI bus location.

How to verify it
Verified that the /dev/nvme* devices are present and functional
Confirmed correct PCI ID using lspci
$ show plat sum
Platform: x86_64-arista_7060x6_64pe_b
HwSKU: Arista-7060X6-64PE-B-C512S2
ASIC: broadcom
ASIC Count: 1
Serial Number: XXXXXXXX
Model Number: DCS-7060X6-64PE-B
Hardware Revision: 02.00
$ lspci -nn | grep -i nvme
04:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E18 PCIe4 NVMe Controller [1987:5018] (rev 01)
Edited sensors.conf and restarted pmon (systemctl restart pmon)
Monitored logs to ensure pmon#sensord no longer reports I/O errors for nvme/#7
jianyuewu pushed a commit that referenced this pull request Aug 14, 2025
…060X6-64PE-B (sonic-net#22781)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
The `sensors.conf` file was referencing a non-existent NVMe PCI address (`nvme-pci-0500`) on the Arista-7060X6-64PE-B platform. This mismatch caused `pmon#sensord` to report repeated I/O errors while attempting to read sensor data for a non-existent device (`nvme/#7`). Updating the config to use the correct PCI address (`nvme-pci-0400`) resolves the issue.

##### Work item tracking
- Microsoft ADO **(number only)**: 32849896

#### How I did it
Modified `sensors.conf` to change the chip identifier from `nvme-pci-0500` to `nvme-pci-0400` to match the actual hardware PCI bus location.

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
- Verified that the `/dev/nvme*` devices are present and functional
- Confirmed correct PCI ID using `lspci`
```
$ show plat sum
Platform: x86_64-arista_7060x6_64pe_b
HwSKU: Arista-7060X6-64PE-B-C512S2
ASIC: broadcom
ASIC Count: 1
Serial Number: XXXXXXXX
Model Number: DCS-7060X6-64PE-B
Hardware Revision: 02.00
$ lspci -nn | grep -i nvme
04:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E18 PCIe4 NVMe Controller [1987:5018] (rev 01)
```
- Edited `sensors.conf` and restarted `pmon` (`systemctl restart pmon`)
- Monitored logs to ensure `pmon#sensord` no longer reports I/O errors for `nvme/#7`

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305
- [x] 202412

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [SONiC.20241211.16 ] <!-- image version 1 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
Fix `sensors.conf` NVMe chip config for Arista-7060X6-64PE-B to match actual PCI address and prevent pmon sensor read errors

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
<!--
     Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

     ** Make sure all your commits include a signature generated with `git commit -s` **

     If this is a bug fix, make sure your description includes "fixes #xxxx", or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it

Adding the below fix from FRR FRRouting/frr#17297

This is to fix the following crash which is a statistical issue

```
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/usr/lib/frr/zebra -A 127.0.0.1 -s 90000000 -M dplane_fpm_nl -M snmp'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
[Current thread is 1 (Thread 0x7fccd6faf7c0 (LWP 36))]
(gdb) bt
#0  0x00007fccd7351e2c in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fccd7302fb2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007fccd72ed472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#3  0x00007fccd75bb3a9 in _zlog_assert_failed (xref=xref@entry=0x7fccd7652380 <_xref.16>, extra=extra@entry=0x0) at ../lib/zlog.c:678
#4  0x00007fccd759b2fe in route_node_delete (node=<optimized out>) at ../lib/table.c:352
#5  0x00007fccd759b445 in route_unlock_node (node=0x0) at ../lib/table.h:258
#6  route_next (node=<optimized out>) at ../lib/table.c:436
#7  route_next (node=node@entry=0x56029d89e560) at ../lib/table.c:410
#8  0x000056029b6b6b7a in if_lookup_by_name_per_ns (ns=ns@entry=0x56029d873d90, ifname=ifname@entry=0x7fccc0029340 "PortChannel1020")
    at ../zebra/interface.c:312
#9  0x000056029b6b8b36 in zebra_if_dplane_ifp_handling (ctx=0x7fccc0029310) at ../zebra/interface.c:1867
#10 zebra_if_dplane_result (ctx=0x7fccc0029310) at ../zebra/interface.c:2221
#11 0x000056029b7137a9 in rib_process_dplane_results (thread=<optimized out>) at ../zebra/zebra_rib.c:4810
#12 0x00007fccd75a0e0d in thread_call (thread=thread@entry=0x7ffe8e553cc0) at ../lib/thread.c:1990
#13 0x00007fccd7559368 in frr_run (master=0x56029d65a040) at ../lib/libfrr.c:1198
#14 0x000056029b6ac317 in main (argc=9, argv=0x7ffe8e5540d8) at ../zebra/main.c:478
```

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it
Added patch.

#### How to verify it
Running BGP tests.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
jianyuewu pushed a commit that referenced this pull request Dec 18, 2025
…n sensors.conf for 7060X6-64PE-B (sonic-net#1165)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it
The `sensors.conf` file was referencing a non-existent NVMe PCI address (`nvme-pci-0500`) on the Arista-7060X6-64PE-B platform. This mismatch caused `pmon#sensord` to report repeated I/O errors while attempting to read sensor data for a non-existent device (`nvme/#7`). Updating the config to use the correct PCI address (`nvme-pci-0400`) resolves the issue.

##### Work item tracking
- Microsoft ADO **(number only)**: 32849896

#### How I did it
Modified `sensors.conf` to change the chip identifier from `nvme-pci-0500` to `nvme-pci-0400` to match the actual hardware PCI bus location.

#### How to verify it

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->
- Verified that the `/dev/nvme*` devices are present and functional
- Confirmed correct PCI ID using `lspci`
```
$ show plat sum
Platform: x86_64-arista_7060x6_64pe_b
HwSKU: Arista-7060X6-64PE-B-C512S2
ASIC: broadcom
ASIC Count: 1
Serial Number: XXXXXXXX
Model Number: DCS-7060X6-64PE-B
Hardware Revision: 02.00
$ lspci -nn | grep -i nvme
04:00.0 Non-Volatile memory controller [0108]: Phison Electronics Corporation E18 PCIe4 NVMe Controller [1987:5018] (rev 01)
```
- Edited `sensors.conf` and restarted `pmon` (`systemctl restart pmon`)
- Monitored logs to ensure `pmon#sensord` no longer reports I/O errors for `nvme/#7`

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305
- [x] 202412

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [SONiC.20241211.16 ] <!-- image version 1 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
Fix `sensors.conf` NVMe chip config for Arista-7060X6-64PE-B to match actual PCI address and prevent pmon sensor read errors

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant