Conversation
|
/azp run |
|
No pipelines are associated with this pull request. |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
community review recording https://zoom.us/rec/share/zT_4Jkb7t2whw3jdxVQhoyAe5umBlS08QgR-ESKNGeRuRGBITyHWOpLmy1yUdNhz.xHl7M4gfgCzasIQh |
|
Reviewer is expected, please leave your comments if you want to be a reviewer. Thanks. |
|
Dell volunteered to be the reviewer. Thanks @jeff-yin |
|
@CharlieChenEC can you please also add the code PRs later by referring to #806 ? Thanks. |
|
Team will continue the review in Platform workgroup. |
| The criteria to choose the representative i2c devices are shown below. | ||
| * In order to minimize the impact to the system performance, the chosen i2c device is seldom visited by the platform driver daemons. | ||
| * The polling interval shall be defined to a value that will not impact the system performance severely. | ||
|
|
There was a problem hiding this comment.
Consequently, if a representative device must be a transceiver, the i2chealthd daemon will compete for access with the xcvrd daemon.
The same could apply to fans and thermalctld
|
|
||
| ### 6. Architecture Design | ||
|
|
||
| #### 6.1. I2C Healthd |
There was a problem hiding this comment.
Is there an advantage to having a separate daemon polling i2c devices? I'm concerned about contention, especially on platforms that have BMC (baseboard management controllers) that may also perform or manage i2c access.
An alternative approach is to add the i2c bus lock detection capability into the individual platform daemons that already have exclusive or almost-exclusive control over the i2c devices. Then we wouldn't run the risk of competing for bus access. For example, xcvrd for transceivers, thermalctld for temp sensors and fans, psud for PSUs, etc. These daemons can manage and sequence their access to i2c, perhaps in a multi-threaded approach. Was this approach considered? If so, please add a section explaining why it wasn't selected.
|
|
||
| * Handling i2c bus lock | ||
| - Reset I2C muxe devices to recover the lock state. | ||
| - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running. |
There was a problem hiding this comment.
Special handling is needed to determine whether the i2c device is a temperature sensor or not.
Please clarify if there is special handling for other types of devices. How about the fans themselves? Transceivers, PSUs, etc.? If not, please indicate as such.
| - Reset I2C muxe devices to recover the lock state. | ||
| - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running. | ||
| - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned. | ||
| - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus. |
There was a problem hiding this comment.
How to determine which daemons need to be recovered after the device(s) are reset? Should the daemons register that they are using a particular device? That would allow for finer-grained control of which services get affected by a lockup.
If we just blindly restart pmon docker, that would be very disruptive and it would disrupt traffic.
| - Reset I2C muxe devices to recover the lock state. | ||
| - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running. | ||
| - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned. | ||
| - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus. |
There was a problem hiding this comment.
Again, having the daemons themselves handle i2c bus lock detection and recovery might be a good approach here, to minimize disruption and allow for finer-grained recovery.
|
|
||
| ### 10. Warmboot and Fastboot Design Impact | ||
|
|
||
| The feature does not support warmboot and fastboot, and it has nothing to do with warmboot and fastboot. |
There was a problem hiding this comment.
i2chealthd would get restarted during warm-boot and fast-reboot, right?
Does the I2C_ISOLATION_LIST need to be cleared, or will the isolation list be saved (i.e., do i2c lockups survive warm/fast boots)?
| Test Case 3: Unplug a faulty transceiver. The faulty transceiver shall be removed from the isolation list. | ||
|
|
||
| Test Case 4: Power off the device. Plug a faulty transceiver on the device and power on the device. The i2c healthd shall detect the faulty transceiver, isolate the device and recover the i2c bus. | ||
|
|
There was a problem hiding this comment.
Additional test cases:
- Simulate a faulty temperature sensor. Ensure that fans go to max RPM.
- Recover the faulty temperature sensor. Ensure that fans resume normal operation.
- Simulate other types of faulty devices, like fans, fan trays, and PSUs.
- Warmboot/fastboot cases if applicable (see my earlier comment).
- Swap multiple times between a faulty and good transceiver, and ensure that
i2chealthdoperates as designed. - Simulate a working but busy i2c bus. For example, a fan whose target RPM is constantly being updated due to environmental temperature fluctuations. Will
i2chealthdconsider it locked up?
jeff-yin
left a comment
There was a problem hiding this comment.
Very interested in this design, as we have seen i2c bus lock issues before on some of our platforms.
| |24|ERROR|Failed to remove {}({}:{}) into the isolation list.|Failed to remove Ethernet1(77-2-72-1:25-0050) into the isolation list.|1-a. First option: remove the transceiver inserted in the port mentioned in the message. 1-b. Second option: reboot the device. 2. Run 'show techsupport' and provide the dump file to the community with the reported issue."| | ||
|
|
||
|
|
||
| #### 6.5. Platform Driver |
There was a problem hiding this comment.
Could you please elaborate the platform driver part?. What if the platform uses kernel driver like i2c-mux-pca954x?.
|
|
||
| In the script of 'show techsupport', it contains the command to access i2c. It will not issue any i2c commands to the i2c devices in the isolation list to prevent from i2c bus lock. | ||
|
|
||
| ### 7. High-Level Design |
There was a problem hiding this comment.
how does this design work on PDDF supported platforms?.
| - Reset I2C muxe devices to recover the lock state. | ||
| - It is known that the thermal sensors readings are accessed through the i2c bus. The fan speed control is also done through i2c bus. When i2c bus is locked, the daemon to enforce thermal policy is not working anymore. To alleviate the risk of the hazard of overheating, this design invokes the special script to set all of the fans on the switch device to full speed as long as the i2c bus health is back to normal. Given that the i2c operations to set fan speed will not be affected by the faulty i2c device, all of the fan speed can be set as full speed running. | ||
| - Perform the algorithm to find out the faulty device and put into the isolation list. Note that the algorithm skips checking on the device that is known to be faulty. In other words, the algorithm only scans on the healthy devices according to the existing isolation list. Only on the first occurrence of the i2c bus lock, all of the i2c devices will be scanned. | ||
| - After the isolation list is generated and the i2c bus health is back to normal. Recover the execution of the daemons that need to access i2c bus. |
There was a problem hiding this comment.
Could you please elaborate it?. For example, if a transceiver causes the I2C bus lock, will xcvrd be stopped?. Restarting xcvrd in media_settings supported platform may cause traffic disruptions.
It is known that the i2c bus will not work if there is an i2c device unexpectedly pulls i2c's SDA signal low. A detailed discussion of the I2C bus lock problem can be found at https://community.nxp.com/t5/i-MX-Processors/I2C-reset/m-p/253182.
We have observed that some transceivers inserted on the switch work fine at first, but after a while run into the i2c bus lock problem. The only way to recover from the i2c bus lock state is to remove the faulty transceiver. However, there is no easy way to identify which transceiver is the one that is locking the i2c bus. The brute force method is to unplug the inserted transceivers one by one to find out. Once the faulty transceiver has been removed, the i2c-related error messages will no longer appear in the syslog.
The purpose of the HLD is to mitigate the aforementioned i2c problem caused by the transceiver. This HLD explains the design to monitor the i2c bus health status in the background, the way to restore the i2c bus by detecting the faulty i2c device, isolating the detected faulty device, and restoring the i2c bus to normal status.