[chassis][midplane] Modify the chassisd to log expected/unexpected midplane connectivity messages#480
Conversation
|
@deepak-singhal0408 @judyjoseph This PR is for an issue of logging lost midplane connectivity log. Total 3 PRs. Please review them. Thanks |
|
Can you provide details (schema) on Chassis Module Reboot Info table which is introduced here. |
|
It is not clear why the Chassis module reboot info entry needs to be removed from platform specific code. Isn't this handled entirely in sonic common code. |
d38ebe6 to
386748a
Compare
On Nokia platform, one of the unpexpect reboot (missing heartbeart reboot) is calling the "sudo reboot". Since "sudo reboot" creates the expected CHASSIS_MODULE_REBOOT_INFO_TABLE entry, we need to remove it for this case. This is platform specified behaviors. |
The CHASSIS_MODULE_REBOOT_INFO_TABLE defined as below: Example: |
|
@mlok-nokia, could you please also add UT case? |
…dplane connectivity messages Signed-off-by: mlok <[email protected]>
386748a to
918461f
Compare
Add mechanism to get the linecard_reboot_timeout value from platform_env.conf file. This provides capabilitiy to different platform can have a different timeout value
UT has been added |
judyjoseph
left a comment
There was a problem hiding this comment.
LGTM, do we need to define this new table here : https://github.com/sonic-net/sonic-swss-common/blob/master/common/schema.h#L440
|
@kenneth-arista could you review as well |
|
MSFT ADO: 28164958 |
… for Nokia-IXR7250E platform (#18862) This PR add the platform specified linecard_reboot_timeout value to the platform_evn.conf. It works PR sonic-net/sonic-platform-daemons#480 and sonic-net/sonic-utilities#3292 to address issue #18540 Signed-off-by: mlok <[email protected]>
… for Nokia-IXR7250E platform (#18862) This PR add the platform specified linecard_reboot_timeout value to the platform_evn.conf. It works PR sonic-net/sonic-platform-daemons#480 and sonic-net/sonic-utilities#3292 to address issue #18540 Signed-off-by: mlok <[email protected]>
Modified the SUP chassisd check_midplane_reachability() function to use the CHASSIS_MODULE_REBOOT_INFO_TABLE data (which is set by Linecard "sudo reboot" command) log expected or unexpected module lost midplane connectivity. This address issue sonic-net/sonic-buildimage#18540
Description
Add a new method is_module_reboot_expected() to check if CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry exists in CHASSIS_STATE_DB when a linecard is not reachable from SUP. If entry exists, it is expected reboot. check_midplane_reachability() will log "pmon#chassisd: Expected: Module LINE-CARD1 lost midplane connectivity". If entry doesn't exist, it will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". The CHASSIS_MODULE_REBOOT_INFO_TABLE|LINECARD# entry created and insert by linecard "sudo reboot" command by PR. It means that Users issue a linecard reboot, "lost midplane connectivity" is expected. Otherwise, such a linecard crash or missing heartbeat reboot, etc is unexpected.
Add new method module_reboot_set_time() and is_module_reboot_system_up_expired() to check if an expected reboot of linecard is not able to be up and detected by SUP in 3 minutes, check_midplane_reachabikity() will log "pmon#chassisd: Unexpected: Module LINE-CARD1 lost midplane connectivity". This provides the log message to the monitoring tool to take any further action.
This PR is required and associated with the following PRs
PR sonic-net/sonic-buildimage#18805
sonic-net/sonic-utilities#3292
#480
sonic-net/sonic-buildimage#18862
Motivation and Context
This provides a proper log message whether a module "lost midplane connectivity" is expected or not. This provides an efficient information log to the monitoring tool to take any further action. Fixes sonic-net/sonic-buildimage#18540
How Has This Been Tested?
This PR requires PRhttps://github.com/sonic-net/sonic-utilities/pull/3292 and to work with
Additional Information (Optional)
This PR needs to be back ported to branchs:
[x] 202205