-
Notifications
You must be signed in to change notification settings - Fork 1.3k
HLD for Shutdown and Startup Fabric module #1694
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
mlok-nokia
wants to merge
3
commits into
sonic-net:master
Choose a base branch
from
mlok-nokia:shutdown_startup_fabric_hld
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # PMON Enhancement -- Shutdown/Startup SFM module | ||
| # High Level Design Document | ||
| ### Rev 1.0 | ||
| - [About this Manual](#about-this-manual) | ||
| - [Scope](#scope) | ||
| - [Background](#background) | ||
| - [1 Requirements Overview](#1-requirements-overview) | ||
| * [1.1 Functional Requirements](#11-functional-requirements) | ||
| - [2 Design Details](#2-design-details) | ||
| * [2.1 Using the existing CLI command "sudo config chassis module shutdown/startup <module_name>" by modifying/enhancing the chassis_module.py](#21-using-the-existing-cli-command--sudo-config-chassis-module-shutdown-startup--module-name---by-modifying-enhancing-the-chassis-modulepy) | ||
| * [2.2 Modify the vendor specified set_admin_state() in the module.py](#22-modify-the-vendor-specified-set-admin-state---in-the-modulepy) | ||
| * [2.3 Modify the ModuleUpdater in chassisd.](#23-modify-the-moduleupdater-in-chassisd) | ||
| - [3 Impact and Test Considerations](#3-impact-and-test-considerations) | ||
| * [3.1 Impact of the PCIed and Thermal sensors](#31-impact-of-the-pcied-and-thermal-sensors) | ||
| * [3.2 Test](#32-test) | ||
| - [4 References](#4-references) | ||
|
|
||
| <small><i><a href='http://ecotrust-canada.github.io/markdown-toc/'>Table of contents generated with markdown-toc</a></i></small> | ||
|
|
||
| ###### Revision | ||
| | Rev | Date | Author | Change Description | | ||
| |:---:|:-----------:|:----------------------------------------------------------------------------------:|-----------------------------------| | ||
| | 1.0 | 05/03/2024 | Marty Lok | Initial public version | | ||
|
|
||
| # About this Manual | ||
| This document describes the requirements for reset System Fabric Module (SFM) on a Chassis and the planned design/code changes for supporting this enhancement. | ||
|
|
||
| # Scope | ||
| This scope of this specification is using the existing module shutdown/startup commands and the modification supports resetting of the SFM Module | ||
|
|
||
| # Background | ||
| In order to avoid the crash of the Swss/Syncd processes of a SFM module, when a SFM module in a chassis is required to be reseated or hot swapped, a proper shutdown and startup procedure needs to be followed. | ||
|
|
||
|
|
||
| # 1 Requirements Overview | ||
| ## 1.1 Functional Requirements | ||
| This section describes the requirement for using existing CLI command shutdown/startup a SFM module on a Chassis system. | ||
| 1. Using the existing CLI command "sudo config chassis module shutdown/startup <module_name>" to shutdown/startup a SFM module | ||
| 2. Module remains down state if system is booting up with a configuration file which contains a module is set to down state | ||
|
|
||
| # 2 Design Details | ||
| The following changes are implementation and modification. | ||
| ## 2.1 Using the existing CLI command "sudo config chassis module shutdown/startup <module_name>" by modifying/enhancing the chassis_module.py | ||
| 1. Define and create a new method fabric_module_set_admin_status() with the following actions | ||
| * Derive a list of ASIC number (asic_list) which is assoicated with this module_name from the CHASSIS_FABRIC_ASIC_TABLE in the CHASSIS_STATE_DB | ||
| * For shutdown case: | ||
| - Loop this asic_list and call the "systemctl stop" to stop the related swss@ and syncd@ services. | ||
mlok-nokia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - Delete the related CHASSIS_FABRIC_ASIC_TABLE entries in the CHASSIS_STATE_DB | ||
| - Loop this asic_list and call the "systemctl start" to start the related swss@ and syncd@ service. The association of service ASIC number with a SFM module is platform specified. If we don't restart of the service here, we are not able to derive the asic_list which is assoicted with this Module when user issues CLI command to start up this SFM module since CHASSIS_FABRIC_ASIC_TABLE entry has been deleted. | ||
mlok-nokia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| * For startup case: | ||
| - Loop the asic_list and call the "systemctl start" to start the related swss@ and syncd@ service | ||
mlok-nokia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 2. Modify the existing shutdown_chassis_module() and startup_chassis_module() method to all the fabric_module_set_admin_status() | ||
mlok-nokia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| * In order to avoid the raised condition of the chassisd re-populates CHASSIS_FABRIC_ASIC_TABLE entry while the shutdown command is executing, the implementation needs to make sure the new admin_status has been set to Redis DB before proceeds to stop related swss/syncd service and remove the CHASSIS_FABRIC_ASIC_TABLE entry. The get_config_module_state_timeout() function is introduced to verify the config value setting in Redis DB. | ||
mlok-nokia marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ## 2.2 Modify the vendor specified set_admin_state() in the module.py | ||
| Modify the set_admin_state() in module.py for SFM module to shutdown/startup a SFM module. This function is already called by the existing ConfigManagerTask class which subscribes the CHASSIS_MODULE in the CONFIG_DB to shutdown/startup a module when user issues the CLI command "sudo config chassis module shutdown/startup <module_name>" on the Supervisor. | ||
|
|
||
| ## 2.3 Modify the ModuleUpdater in chassisd. | ||
| Modify the ModuleUpdater class in chassisd to keep a SFM module in the down state when system is booting up with a configuration which contains a shutdown of a SMF module. | ||
| 1. Create and add a new function get_module_admin_status() to get the admin_status from CHASSIS_CONFIG_TABLE in CONFIG_DB | ||
| 2. Modify the module_db_update() to call get_module_admin_status() to check the config module. If the module_cfg_status is not set to down, then populate the CH-TBDASSIS_FABRIC_ASIC_TABLE. Otherwise, just ignore it even the SFM module is present. This mechanism prevents the event is triggered in the swss.sh when admin_status is set to down state. | ||
|
|
||
|
|
||
| # 3 Impact and Test Considerations | ||
| ## 3.1 Impact of the PCIed and Thermal sensors | ||
| For PCIed, based on the investigation, the current design of the Fabric module shutdown has NO impact on the checking of the basic PCI components. But it may impact the checking on the Fabric module which has been shut down. For the Fabric module which has been shut down, PCIed checking should skip that slot. Fabric module should only be added to the checking list when it's power is on. This should be handled/done in the vendor specified code. If the list of PCI components is built during system startup, it requires to remove or rebuild to exclude Fabric module checking when it is shut down. Or dynamically skip that slot during checking. | ||
|
|
||
| For the thermal sensors of the Fabric card, this should be handled by the vendor's specified code. If module is shutdown, the vendor sonic-platform thermal query should not return any entry for that particular slot. | ||
|
|
||
| ## 3.2 Test | ||
| UTs are also added to simulate the Fabric shutdown and startup | ||
|
|
||
| # 4 References | ||
| -TBD | ||
|
|
||
|
|
||
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.