[chassisd] Address the chassisd crash issue and add UT for it#573
Conversation
Signed-off-by: mlok <marty.lok@nokia.com>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@judyjoseph and @arlakshm This PR address the chassisd crash issue. I also add UT check module_db_update support slot number or slot string. Please review |
judyjoseph
left a comment
There was a problem hiding this comment.
LGTM.
@mlok-nokia can you also confirm it will not conflict with these two platform API's module.get_slot(), chassis.get_my_slot() which as per commens will return int.
|
@kenneth-arista @anamehra could you check in your platforms too if this change is ok ? |
|
@kenneth-arista, please review this change |
| module_info_dict = self._get_module_info(my_index) | ||
| if not self._is_supervisor(): | ||
| module_info_dict = self._get_module_info(my_index) | ||
| hostname_key = "{}{}".format(ModuleBase.MODULE_TYPE_LINE, int(self.my_slot) - 1) |
There was a problem hiding this comment.
What is the expectation here? This makes an assumption that line card slots must be numbers, but should that be a requirement? Should this be using my_index instead? Or should there be some logic to handle if the LC slot was non-numeric?
There was a problem hiding this comment.
Move the self._get_module_info() call inside if condition since my_slot is only use inside. Not need to call self._get_module_info(0 if it not required.
There was a problem hiding this comment.
But this is still requiring that line cards must be named as numbers, only supervisors are allowed to be alphanumeric. Is that expected and required? Why can't alphanumeric line card names also be supported since this was already found to be a limitation of supervisor naming?
There was a problem hiding this comment.
Yes. Linecard slot must be number string or number. It is required. Otherwise, we will not able construct the linecard name. With the existing implementation, Linecard slot is a number or a number string, it will work. BTW, this change just moved the _get_module_info() call into "if" section since we only use it inside the "if".
There was a problem hiding this comment.
The line card name could be constructed if the slot wasn't numeric (like the supervisor) as well, the name is still just a string. I'm not sure what else relies on line card slots being numeric though, so this should be fine for now. It may be worth looking into what changes would be required to support alphanumeric slots for line cards as well, but not required for this change.
There was a problem hiding this comment.
Agree. For the long term solution, we should introduce label attribute to support alphanumeric slot.
| module_info_dict[CHASSIS_MODULE_INFO_NAME_FIELD] = name | ||
| module_info_dict[CHASSIS_MODULE_INFO_DESC_FIELD] = str(desc) | ||
| module_info_dict[CHASSIS_MODULE_INFO_SLOT_FIELD] = str(slot) | ||
| module_info_dict[CHASSIS_MODULE_INFO_SLOT_FIELD] = slot |
There was a problem hiding this comment.
This may be an unexpected behavior change. slot can still be an integer, so module_info_dict[CHASSIS_MODULE_INFO_SLOT_FIELD] can now have varying types. Previously when slot was always int, all logic was expecting this field to be str` since it was converted here (though I'm not sure any logic relies on that type). It may be best to revert this change to maintain a consistent (and existing) type requirement.
I do see there is a conversion for one of the usages of this dict entry to str to account for this, but it may just be best to maintain the expectation that this dict entry is always str from this point, as it was previously.
There was a problem hiding this comment.
The fundamental issue is that a slot value on a chassis may not be a number, it could be a string -- such as Nokia IXRE7250 Supervisor card which is slot "A".
It is not necessary to convert slot value to a string in the module_info_dict[] here and then convert it back to a number when use it later. We can directly use it as what function get_slot() returned. For consistency, the only place is needed to be converted to a string when it is stored into a STATE_DB. This change simplfy it and no need to convert it back and forth (StringToNumber or NumberToString) when use the slot info in module_info_dict[]. This change will address this issue forever. Otherwise, another solution could be a big change.
There was a problem hiding this comment.
Sounds good. My concern was mostly that it may be unclear to developers that the type stored in this dict field can vary. Functionally it seems fine. The documentation of APIs such as get_slot and get_my_slot should be updated as @judyjoseph previously mentioned; they not need not return only an int, they may also return str.
There was a problem hiding this comment.
Sounds good. My concern was mostly that it may be unclear to developers that the type stored in this dict field can vary. Functionally it seems fine. The documentation of APIs such as
get_slotandget_my_slotshould be updated as @judyjoseph previously mentioned; they not need not return only an int, they may also return str.
Agree. For the long run, this needs to be addressed.
Yes. Based on the design, both functions should return the same type and same value. Otherwise, the existing code doesn't work. |
|
@judyjoseph, @spilkey-cisco if the comments are resolved please help approve this PR. @kenneth-arista please signoff if this change works on Arista platform. |
kenneth-arista
left a comment
There was a problem hiding this comment.
I tested this quickly and the change works for me.
…net#573) Description On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131 Motivation and Context Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it. How Has This Been Tested? Tested on 202405 branch Signed-off-by: mlok <marty.lok@nokia.com>
|
Cherry-pick PR to 202405: #580 |
Description On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131 Motivation and Context Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it. How Has This Been Tested? Tested on 202405 branch Signed-off-by: mlok <marty.lok@nokia.com>
…net#573) Description On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131 Motivation and Context Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it. How Has This Been Tested? Tested on 202405 branch Signed-off-by: mlok <marty.lok@nokia.com>
…evice is in detaching mode (#546) * Skip logging the warning, if device is in detaching mode * Add detach_info table and unittests * Fix unit tests * Increase code coverage * Remove unused header import * Fix dict get values * Increase code coverage * Increase test coverage * [SmartSwitch] Extend implementation of the DPU chassis daemon. (#563) * Addition of DPU Chassis for thermalctld (#564) * [stormond] Added new dynamic field 'last_sync_time' to STATE_DB (#535) * Added new dynamic field 'last_sync_time' that shows when STORAGE_INFO for disk was last synced to STATE_DB * Moved 'start' message to actual starting point of the daemon * Added functions for formatted and epoch time for user friendly time display * Made changes per prgeor review comments * Pivot to SysLogger for all logging * Increased log level so that they are seen in syslogs * Code coverage improvement * [lag_id] Add lagid to free_list when LC absent for 30 minutes (#542) When LC is absent for 30 minutes, the database cleanup kicks in. When LagId is released, it needs to be appended to the SYSTEM_LAG_IDS_FREE_LIST This PR works with the following 2 PRs: sonic-net/sonic-swss#3303 sonic-net/sonic-buildimage#20369 Signed-off-by: mlok <marty.lok@nokia.com> * Fixed bug in chassisd causing incorrect number of ASICs in CHASSIS_STATE_DB (#560) Fixed the bug in chassisd due to which incorrect number of ASICs were being pushed to CHASSIS_STATE_DB. * thermalctld: Add support for fans on non-CPU modules (#555) * thermalctld: Add support for fans on non-CPU modules * Add module fan to unit tests * Advanced Azure pipeline to Bookworm (#572) Description This PR advances the azure pipeline on sonic_platform_daemons from bullseye to bookworm. This fixes the issue where sonic-platform-daemons azp is having some issues due to upgrade to bookworm. See Pipelines - Run 20241210.8 logs for details. * Take non-CMIS xcvrs out of lpmode in SFF Manager (#565) Description Fix non-CMIS transceivers in down state by bringing them out of low power mode in the SFF Manager Task. This is intended to work together with the change in sonic-net/sonic-buildimage#20886. Motivation and Context Non-CMIS transceivers were not functioning correctly when put into Low Power mode. So XCVRD now brings them out of lpmode. How Has This Been Tested? Loaded an image containing this change alongside the change from sonic-net/sonic-buildimage#20886 on an Arista chassis containing a Clearwater2 linecard. Verified that without this image some interfaces were in a down state but with the image all interfaces came up as expected. * Added SmartSwitch support in chassisd and enabling chassisd (#467) Added SmartSwitch support in chassisd and enabling chassisd * [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function (#576) Description Move the PSU parent information generation to the loop run function from the initialization function Motivation and Context Fixes #575 How Has This Been Tested? Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart. And monitored the stated db for memory for hours, works well: * [chassisd] Address the chassisd crash issue and add UT for it (#573) Description On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131 Motivation and Context Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it. How Has This Been Tested? Tested on 202405 branch Signed-off-by: mlok <marty.lok@nokia.com> * Fix a comment --------- Signed-off-by: mlok <marty.lok@nokia.com> Co-authored-by: Oleksandr Ivantsiv <oivantsiv@nvidia.com> Co-authored-by: Gagan Punathil Ellath <gpunathilell@nvidia.com> Co-authored-by: Ashwin Srinivasan <93744978+assrinivasan@users.noreply.github.com> Co-authored-by: Marty Y. Lok <76118573+mlok-nokia@users.noreply.github.com> Co-authored-by: Vivek Verma <137406113+vivekverma-arista@users.noreply.github.com> Co-authored-by: Patrick MacArthur <pmacarthur@arista.com> Co-authored-by: Peter Bailey <peterbailey@arista.com> Co-authored-by: rameshraghupathy <43161235+rameshraghupathy@users.noreply.github.com> Co-authored-by: Jianquan Ye <jianquanye@microsoft.com>
Description
On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131
Motivation and Context
Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it.
How Has This Been Tested?
Tested on 202405 branch
Which release branch to backport (provide reason below if selected)
Additional Information (Optional)