I2C Bus Lock Prevention HLD#1907

Open

CharlieChenEC wants to merge 3 commits intosonic-net:masterfrom

CharlieChenEC:i2c_lock_prevention

CharlieChenEC commented Feb 5, 2025 •

edited

Loading

It is known that the i2c bus will not work if there is an i2c device unexpectedly pulls i2c's SDA signal low. A detailed discussion of the I2C bus lock problem can be found at https://community.nxp.com/t5/i-MX-Processors/I2C-reset/m-p/253182.

We have observed that some transceivers inserted on the switch work fine at first, but after a while run into the i2c bus lock problem. The only way to recover from the i2c bus lock state is to remove the faulty transceiver. However, there is no easy way to identify which transceiver is the one that is locking the i2c bus. The brute force method is to unplug the inserted transceivers one by one to find out. Once the faulty transceiver has been removed, the i2c-related error messages will no longer appear in the syslog.

The purpose of the HLD is to mitigate the aforementioned i2c problem caused by the transceiver. This HLD explains the design to monitor the i2c bus health status in the background, the way to restore the i2c bus by detecting the faulty i2c device, isolating the detected faulty device, and restoring the i2c bus to normal status.

Repo	PR title	State
sonic-buildimage	TBD
sonic-platform-common	TBD
sonic-platform-daemons	TBD


          I2C Bus Lock Prevention HLD

8ba395c

Collaborator

mssonicbld commented Feb 5, 2025

/azp run

azure-pipelines bot commented Feb 5, 2025

No pipelines are associated with this pull request.


          I2C Bus Lock Prevention HLD: Revise some typos and wordings.

d4a235f

Collaborator

mssonicbld commented Mar 10, 2025

/azp run

azure-pipelines bot commented Mar 10, 2025

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.


          I2C Bus Lock Prevention HLD: Revise some typos and wordings.

4818f2e

Collaborator

mssonicbld commented Mar 11, 2025

/azp run

azure-pipelines bot commented Mar 11, 2025

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Collaborator

zhangyanzhao commented Mar 11, 2025

community review recording https://zoom.us/rec/share/zT_4Jkb7t2whw3jdxVQhoyAe5umBlS08QgR-ESKNGeRuRGBITyHWOpLmy1yUdNhz.xHl7M4gfgCzasIQh

Collaborator

zhangyanzhao commented Mar 11, 2025

Reviewer is expected, please leave your comments if you want to be a reviewer. Thanks.

Collaborator

zhangyanzhao commented Mar 11, 2025

Dell volunteered to be the reviewer. Thanks @jeff-yin

zhangyanzhao requested a review from jeff-yin

March 11, 2025 15:39

Collaborator

zhangyanzhao commented Mar 11, 2025

@CharlieChenEC can you please also add the code PRs later by referring to #806 ? Thanks.

Collaborator

zhangyanzhao commented Mar 11, 2025

Team will continue the review in Platform workgroup.

Collaborator

zhangyanzhao commented Mar 11, 2025

https://lists.sonicfoundation.dev/g/platform-os-working-group/

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md Show resolved Hide resolved

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

+              The criteria to choose the representative i2c devices are shown below.
+              * In order to minimize the impact to the system performance, the chosen i2c device is seldom visited by the platform driver daemons.
+              * The polling interval shall be defined to a value that will not impact the system performance severely.

jeff-yin Mar 13, 2025

Consequently, if a representative device must be a transceiver, the i2chealthd daemon will compete for access with the xcvrd daemon.

The same could apply to fans and thermalctld

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md


		### 6. Architecture Design

		#### 6.1. I2C Healthd

jeff-yin Mar 13, 2025

Is there an advantage to having a separate daemon polling i2c devices? I'm concerned about contention, especially on platforms that have BMC (baseboard management controllers) that may also perform or manage i2c access.

An alternative approach is to add the i2c bus lock detection capability into the individual platform daemons that already have exclusive or almost-exclusive control over the i2c devices. Then we wouldn't run the risk of competing for bus access. For example, xcvrd for transceivers, thermalctld for temp sensors and fans, psud for PSUs, etc. These daemons can manage and sequence their access to i2c, perhaps in a multi-threaded approach. Was this approach considered? If so, please add a section explaining why it wasn't selected.

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

+              * Handling i2c bus lock
+                  - Reset I2C muxe devices to recover the lock state.
+                  - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.

jeff-yin Mar 13, 2025

Special handling is needed to determine whether the i2c device is a temperature sensor or not.

Please clarify if there is special handling for other types of devices. How about the fans themselves? Transceivers, PSUs, etc.? If not, please indicate as such.

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

+                  - Reset I2C muxe devices to recover the lock state.
+                  - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
+                  - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
+                  - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

jeff-yin Mar 13, 2025

How to determine which daemons need to be recovered after the device(s) are reset? Should the daemons register that they are using a particular device? That would allow for finer-grained control of which services get affected by a lockup.
If we just blindly restart pmon docker, that would be very disruptive and it would disrupt traffic.

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

+                  - Reset I2C muxe devices to recover the lock state.
+                  - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
+                  - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
+                  - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

jeff-yin Mar 13, 2025

Again, having the daemons themselves handle i2c bus lock detection and recovery might be a good approach here, to minimize disruption and allow for finer-grained recovery.

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md


		### 10. Warmboot and Fastboot Design Impact

		The feature does not support warmboot and fastboot, and it has nothing to do with warmboot and fastboot.

jeff-yin Mar 13, 2025

i2chealthd would get restarted during warm-boot and fast-reboot, right?
Does the I2C_ISOLATION_LIST need to be cleared, or will the isolation list be saved (i.e., do i2c lockups survive warm/fast boots)?

jeff-yin reviewed

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

		Test Case 3: Unplug a faulty transceiver. The faulty transceiver shall be removed from the isolation list.

		Test Case 4: Power off the device. Plug a faulty transceiver on the device and power on the device. The i2c healthd shall detect the faulty transceiver, isolate the device and recover the i2c bus.

jeff-yin Mar 13, 2025

Additional test cases:

Simulate a faulty temperature sensor. Ensure that fans go to max RPM.
Recover the faulty temperature sensor. Ensure that fans resume normal operation.
Simulate other types of faulty devices, like fans, fan trays, and PSUs.
Warmboot/fastboot cases if applicable (see my earlier comment).
Swap multiple times between a faulty and good transceiver, and ensure that i2chealthd operates as designed.
Simulate a working but busy i2c bus. For example, a fan whose target RPM is constantly being updated due to environmental temperature fluctuations. Will i2chealthd consider it locked up?

jeff-yin requested changes

View reviewed changes

jeff-yin left a comment

Very interested in this design, as we have seen i2c bus lock issues before on some of our platforms.

aravindmani-1 suggested changes

View reviewed changes

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

		\|24\|ERROR\|Failed to remove {}({}:{}) into the isolation list.\|Failed to remove Ethernet1(77-2-72-1:25-0050) into the isolation list.\|1-a. First option: remove the transceiver inserted in the port mentioned in the message. 1-b. Second option: reboot the device. 2. Run 'show techsupport' and provide the dump file to the community with the reported issue."\|


		#### 6.5. Platform Driver

aravindmani-1 Mar 14, 2025

Could you please elaborate the platform driver part?. What if the platform uses kernel driver like i2c-mux-pca954x?.

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md


		In the script of 'show techsupport', it contains the command to access i2c. It will not issue any i2c commands to the i2c devices in the isolation list to prevent from i2c bus lock.

		### 7. High-Level Design

aravindmani-1 Mar 14, 2025

how does this design work on PDDF supported platforms?.

doc/i2c_bus_lock_prevention/i2c_bus_lock_prevention_hld.md

+                  - Reset I2C muxe devices to recover the lock state.
+                  - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running.
+                  - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned.
+                  - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus.

aravindmani-1 Mar 14, 2025

Could you please elaborate it?. For example, if a transceiver causes the I2C bus lock, will xcvrd be stopped?. Restarting xcvrd in media_settings supported platform may cause traffic disruptions.

zhangyanzhao added this to SONiC 202505 Release

zhangyanzhao moved this to 🏗 In Progress in SONiC 202505 Release

anilpannala moved this from 🏗 In Progress to MovedToBacklog in SONiC 202505 Release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet