Skip to content

I2C Bus Lock Prevention HLD#1907

Open
CharlieChenEC wants to merge 3 commits intosonic-net:masterfrom
CharlieChenEC:i2c_lock_prevention
Open

I2C Bus Lock Prevention HLD#1907
CharlieChenEC wants to merge 3 commits intosonic-net:masterfrom
CharlieChenEC:i2c_lock_prevention

Conversation

@CharlieChenEC
Copy link

@CharlieChenEC CharlieChenEC commented Feb 5, 2025

It is known that the i2c bus will not work if there is an i2c device unexpectedly pulls i2c's SDA signal low. A detailed discussion of the I2C bus lock problem can be found at https://community.nxp.com/t5/i-MX-Processors/I2C-reset/m-p/253182.

We have observed that some transceivers inserted on the switch work fine at first, but after a while run into the i2c bus lock problem. The only way to recover from the i2c bus lock state is to remove the faulty transceiver. However, there is no easy way to identify which transceiver is the one that is locking the i2c bus. The brute force method is to unplug the inserted transceivers one by one to find out. Once the faulty transceiver has been removed, the i2c-related error messages will no longer appear in the syslog.

The purpose of the HLD is to mitigate the aforementioned i2c problem caused by the transceiver. This HLD explains the design to monitor the i2c bus health status in the background, the way to restore the i2c bus by detecting the faulty i2c device, isolating the detected faulty device, and restoring the i2c bus to normal status.

Repo PR title State
sonic-buildimage TBD
sonic-platform-common TBD
sonic-platform-daemons TBD

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

No pipelines are associated with this pull request.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@zhangyanzhao
Copy link
Collaborator

@zhangyanzhao
Copy link
Collaborator

Reviewer is expected, please leave your comments if you want to be a reviewer. Thanks.

@zhangyanzhao
Copy link
Collaborator

Dell volunteered to be the reviewer. Thanks @jeff-yin

@zhangyanzhao zhangyanzhao requested a review from jeff-yin March 11, 2025 15:39
@zhangyanzhao
Copy link
Collaborator

@CharlieChenEC can you please also add the code PRs later by referring to #806 ? Thanks.

@zhangyanzhao
Copy link
Collaborator

Team will continue the review in Platform workgroup.

@zhangyanzhao
Copy link
Collaborator

The criteria to choose the representative i2c devices are shown below.
* In order to minimize the impact to the system performance, the chosen i2c device is seldom visited by the platform driver daemons.
* The polling interval shall be defined to a value that will not impact the system performance severely.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consequently, if a representative device must be a transceiver, the i2chealthd daemon will compete for access with the xcvrd daemon.

The same could apply to fans and thermalctld


### 6. Architecture Design

#### 6.1. I2C Healthd

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an advantage to having a separate daemon polling i2c devices? I'm concerned about contention, especially on platforms that have BMC (baseboard management controllers) that may also perform or manage i2c access.

An alternative approach is to add the i2c bus lock detection capability into the individual platform daemons that already have exclusive or almost-exclusive control over the i2c devices. Then we wouldn't run the risk of competing for bus access. For example, xcvrd for transceivers, thermalctld for temp sensors and fans, psud for PSUs, etc. These daemons can manage and sequence their access to i2c, perhaps in a multi-threaded approach. Was this approach considered? If so, please add a section explaining why it wasn't selected.


* Handling i2c bus lock
- Reset I2C muxe devices to recover the lock state.
- It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Special handling is needed to determine whether the i2c device is a temperature sensor or not.

Please clarify if there is special handling for other types of devices. How about the fans themselves? Transceivers, PSUs, etc.? If not, please indicate as such.

- Reset I2C muxe devices to recover the lock state.
- It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
- Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
- After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to determine which daemons need to be recovered after the device(s) are reset? Should the daemons register that they are using a particular device? That would allow for finer-grained control of which services get affected by a lockup.
If we just blindly restart pmon docker, that would be very disruptive and it would disrupt traffic.

- Reset I2C muxe devices to recover the lock state.
- It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
- Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
- After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, having the daemons themselves handle i2c bus lock detection and recovery might be a good approach here, to minimize disruption and allow for finer-grained recovery.


### 10. Warmboot and Fastboot Design Impact

The feature does not support warmboot and fastboot, and it has nothing to do with warmboot and fastboot.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i2chealthd would get restarted during warm-boot and fast-reboot, right?
Does the I2C_ISOLATION_LIST need to be cleared, or will the isolation list be saved (i.e., do i2c lockups survive warm/fast boots)?

Test Case 3: Unplug a faulty transceiver. The faulty transceiver shall be removed from the isolation list.

Test Case 4: Power off the device. Plug a faulty transceiver on the device and power on the device. The i2c healthd shall detect the faulty transceiver, isolate the device and recover the i2c bus.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional test cases:

  • Simulate a faulty temperature sensor. Ensure that fans go to max RPM.
  • Recover the faulty temperature sensor. Ensure that fans resume normal operation.
  • Simulate other types of faulty devices, like fans, fan trays, and PSUs.
  • Warmboot/fastboot cases if applicable (see my earlier comment).
  • Swap multiple times between a faulty and good transceiver, and ensure that i2chealthd operates as designed.
  • Simulate a working but busy i2c bus. For example, a fan whose target RPM is constantly being updated due to environmental temperature fluctuations. Will i2chealthd consider it locked up?

Copy link

@jeff-yin jeff-yin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interested in this design, as we have seen i2c bus lock issues before on some of our platforms.

|24|ERROR|Failed to remove {}({}:{}) into the isolation list.|Failed to remove Ethernet1(77-2-72-1:25-0050) into the isolation list.|1-a. First option: remove the transceiver inserted in the port mentioned in the message. 1-b. Second option: reboot the device. 2. Run 'show techsupport' and provide the dump file to the community with the reported issue."|


#### 6.5. Platform Driver

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate the platform driver part?. What if the platform uses kernel driver like i2c-mux-pca954x?.


In the script of 'show techsupport', it contains the command to access i2c. It will not issue any i2c commands to the i2c devices in the isolation list to prevent from i2c bus lock.

### 7. High-Level Design

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this design work on PDDF supported platforms?.

- Reset I2C muxe devices to recover the lock state.
- It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
- Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
- After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate it?. For example, if a transceiver causes the I2C bus lock, will xcvrd be stopped?. Restarting xcvrd in media_settings supported platform may cause traffic disruptions.

@zhangyanzhao zhangyanzhao moved this to 🏗 In Progress in SONiC 202505 Release Apr 1, 2025
@anilpannala anilpannala moved this from 🏗 In Progress to MovedToBacklog in SONiC 202505 Release May 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: MovedToBacklog

Development

Successfully merging this pull request may close these issues.

6 participants