Skip to content

[Mellanox] Validate Module Presence Using Sysfs Before Accessing EEPROM in Thermal Control Daemon#17

Closed
tshalvi wants to merge 4 commits intomasterfrom
master_get_presence_enhancement
Closed

[Mellanox] Validate Module Presence Using Sysfs Before Accessing EEPROM in Thermal Control Daemon#17
tshalvi wants to merge 4 commits intomasterfrom
master_get_presence_enhancement

Conversation

@tshalvi
Copy link
Owner

@tshalvi tshalvi commented Jun 2, 2024

Why I did it

Currently, when trying to read from the EEPROM of an unplugged module from thermalctld, we get the following error:
ERR kernel: [ 2446.261799] sxd_kernel: [error] Failed to get module page valid, err: -5
We need to ensure the EEPROM is not accessed if a module is not connected.

Work item tracking
  • Microsoft ADO (number only):

How I did it

I updated the logic of get_presence() to rely on the present/hw_present sysfs values and called get_presence() from within the relevant methods in thermalctld.

How to verify it

Unplug a module and ensure the following error does not appear:
ERR kernel: [ 2446.261799] sxd_kernel: [error] Failed to get module page valid, err: -5

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211
  • 202305

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@tshalvi tshalvi changed the title [Mellanox] get_presence() logic update to rely on hw_present/present sysfs values [Mellanox] Validate Module Presence Using Sysfs Before Accessing EEPROM in Thermal Control Daemon Jun 2, 2024
0.0 if module temperature is not supported or module is under initialization
other float value if module temperature is available
"""
if not self.get_presence():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could move the check to thermal.py class ModuleThermal. Usually, we prefer caller to check get_presence.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tshalvi tshalvi requested a review from Junchao-Mellanox June 3, 2024 08:41
tshalvi pushed a commit that referenced this pull request Sep 5, 2024
* Update to Linux 6.1.94
* Integrate HW-MGMT 7.0040.1008 Changes (#17)
* Update DNX kernel module build
* Update kernel and saibcm-modules-dnx to versions on branch

Signed-off-by: Saikrishna Arcot <sarcot@microsoft.com>
Co-authored-by: Vivek <vivekreddykarri98@gmail.com>
tshalvi pushed a commit that referenced this pull request Nov 24, 2024
…7250E platform (sonic-net#20367)

Update sonic-platform submodule for Nokia-IXR7250E:
Fixes Nokia-ION/ndk#57

cdfbbe2 [H4-32D]Update platform modules after OC tests (Update README.md #17)
f28eff0 [H4-64D]Fix SFP+ port, eeprom, reboot-cause, thermal algorithm, add PSU input voltage check (Fix rules in Makefiles #15)
178e15a Minor watchdog change for better retention of last kick stamp
c479392 Remove rogue platform_reboot file
331abe0 Enhance watchdog script to detect fsde device hung signature
4c6b7c1 Fixed update temperature issue
5002fb7 Remove average and maximum
c620130 No PSU Master status led in IMM. No need to set it

Signed-off-by: mlok <marty.lok@nokia.com>
@tshalvi tshalvi closed this Feb 16, 2025
tshalvi pushed a commit that referenced this pull request Mar 12, 2025
…tically (sonic-net#678)

#### Why I did it
src/sonic-sairedis
```
* fcf2cd0 - (HEAD -> 202412, origin/HEAD, origin/202412) [hash] update ECMP/LAG hash VS lib with SAI_NATIVE_HASH_FIELD_IPV6_FLOW_LABEL (#17) (6 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
tshalvi pushed a commit that referenced this pull request Mar 12, 2025
…omatically (sonic-net#696)

#### Why I did it
src/sonic-swss-common
```
* b750cc1 - (HEAD -> 202412, origin/HEAD, origin/202412) [code sync] Merge code from sonic-net/sonic-swss-common:202411 to 202412 (#17) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
tshalvi pushed a commit that referenced this pull request Mar 12, 2025
…tomatically (sonic-net#695)

#### Why I did it
src/sonic-linux-kernel
```
* b2ed221 - (HEAD -> 202412, origin/HEAD, origin/202412) [optoe] Reset page select byte to 0 before upper memory access on page 0h (sonic-net#464) (#17) (21 hours ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
tshalvi pushed a commit that referenced this pull request Aug 25, 2025
…UT so that we can get back to back Paladin ports up with Arista-7060X6-16PE-384C-O128S2 (sonic-net#1144)

<!--
 Please make sure you've read and understood our contributing guidelines:
 https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

 failure_prs.log skip_prs.log Make sure all your commits include a signature generated with `git commit -s` **

 If this is a bug fix, make sure your description includes "fixes #xxxx", or
 "closes #xxxx" or "resolves #xxxx"

 Please provide the following information:
-->

#### Why I did it

Currently when we loaded HWSKU `Arista-7060X6-16PE-384C-O128S2` on two moby devices and connect their Paladin ports back to back, we can't get link up. It may help if we can get these links up and run the tests.

##### Work item tracking
- Microsoft ADO **(number only)**:

#### How I did it

Created a new `FANOUT` HWSKU containing special lanemap and polarity configs so that we can load `Arista-7060X6-16PE-384C-O128S2` on one Moby and `Arista-7060X6-16PE-384C-O128S2-FANOUT` and get Paladin ports up when connecting them back to back with the following setup:
```
Moby1 Moby2
HWSKU: Arista-7060X6-16PE-384C-O128S2 HWSKU: Arista-7060X6-16PE-384C-O128S2-FANOUT
#17 <-> #18
#19 <-> #20
#21 <-> #22
#23 <-> #24

#18 <-> #17
#20 <-> #19
#22 <-> #21
#24 <-> #23
```

#### How to verify it
Verified that all the Paladin ports can link up with the above setup.

<!--
If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211
- [ ] 202305
- [x] msft-202412

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->
- [x] msft-202412

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->
Created `Arista-7060X6-16PE-384C-O128S2-FANOUT` based on `Arista-7060X6-16PE-384C-O128S2` and only update lanemap and polarity settings in bcm config.

<!--
 Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants