[Smartswitch] Add module specific pcie attach/detach functions for smartswitch platforms#557
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
@vvolam @rameshraghupathy Please review |
|
/azpw run |
|
/AzurePipelines run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
rameshraghupathy
left a comment
There was a problem hiding this comment.
@vvolam @gpunathilell Adding @bmridul to review
There was a problem hiding this comment.
Pull Request Overview
This PR introduces new PCI attach/detach functions and related state database operations for smartswitch platforms, along with comprehensive tests for these new functionalities.
- Added new methods to ModuleBase for handling PCI removal and reattach operations via platform.json.
- Introduced a file-based locking mechanism for PCI operations and state database updates.
- Extended test coverage for PCI-related functionalities.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| tests/module_base_test.py | Added tests for PCI bus retrieval, state DB entry, locking & PCI operations |
| sonic_platform_base/module_base.py | Added functions for PCI removal/reattach and file-based PCI locking |
sonic_platform_base/module_base.py
Outdated
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "bus_info", pcie_string) | ||
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "dpu_state", operation) | ||
| except Exception as e: | ||
| sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e))) |
There was a problem hiding this comment.
There is a spelling error in the error message ('infoto' should be 'info to'). Please correct the typo.
| sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e))) | |
| sys.stderr.write("Failed to write pcie bus info to state database: {}\n".format(str(e))) |
sonic_platform_base/module_base.py
Outdated
| except Exception as e: | ||
| sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e))) | ||
|
|
||
| def pci_reattach_from_platform_json(self): |
There was a problem hiding this comment.
@vvolam @gpunathilell This class is a module specific class and we are trying to rescan not only all the modules but also other PCI devices on the box which are not even considered as modules. Can you move this to your module.py ?
There was a problem hiding this comment.
@rameshraghupathy Correct me if I am wrong, but there is no other way of performing the rescan (unless the DPUs are under the same parent bus, but doing it in the current method would work even if the DPUs are not under the same bus, so the current implementation is the most generic)
https://github.com/torvalds/linux/blob/7a13c14ee59d4f6c5f4277a86516cbc73a1383a8/Documentation/ABI/testing/sysfs-bus-pci#L74
If that doesn't work then we should be implementing a platform specific pcie_reattach() function.
There was a problem hiding this comment.
As discussed in the HLD meeting, we will look into moving the rescan for the whole PCIE tree to chassis instead of module
There was a problem hiding this comment.
@rameshraghupathy I had a discussion with Vasundhara, we can move the implementation for the rescan to chassis, but we would have to call both chassis rescan and module rescan separately each time (since the pci bus removal is module specific). So the plan is to remove the independent implementation currently present pci_reattach_from_platform_json and pci_removal_from_platform_json. So pci_detach() and pci_rescan() have to be implemented by vendor.
There was a problem hiding this comment.
@vvolam @rameshraghupathy I have removed the relevant pcie removal code, Please re review
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
vvolam
left a comment
There was a problem hiding this comment.
LGTM, other than these minor query/comment
| """ | ||
| # Device type definition. Note, this is a constant. | ||
| DEVICE_TYPE = "module" | ||
| PCI_OPERATION_LOCK_FILE_PATH = "/var/lock/{}_pci.lock" |
There was a problem hiding this comment.
Query: Is this file separate for each module?
There was a problem hiding this comment.
the lock should be valid across all modules, so we are using a generic lock file
There was a problem hiding this comment.
As discussed offline, could you check if this lock is still required?
There was a problem hiding this comment.
Clarification, File lock is required, and the lock is applicable per module, just to prevent reattach of the module while we are removing the module
| try: | ||
| bus_info_list = self.get_pci_bus_info() | ||
| with self._pci_operation_lock(): | ||
| for bus in bus_info_list: |
There was a problem hiding this comment.
should we do the attach first and then remove the entry just to be safe?
There was a problem hiding this comment.
Order is changed
sonic_platform_base/module_base.py
Outdated
| if self.pci_bus_info: | ||
| return self.pci_bus_info | ||
| try: | ||
| with open("/usr/share/sonic/platform/platform.json", 'r') as f: |
There was a problem hiding this comment.
shouldn't this be "/usr/share/sonic/device/{platform}/platform.json" ?
There was a problem hiding this comment.
Function removed
sonic_platform_base/module_base.py
Outdated
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def get_pci_bus_from_platform_json(self): |
There was a problem hiding this comment.
This function is not used anywhere in this file. This looks like an utility function. Shouldn't this be under utilities?
There was a problem hiding this comment.
Unused function, this is removed
sonic_platform_base/module_base.py
Outdated
| with self._pci_operation_lock(): | ||
| for bus in bus_info_list: | ||
| self.pci_entry_state_db(bus, PCIE_OPERATION_ATTACHING) | ||
| return self.pci_reattach() |
There was a problem hiding this comment.
Is the intent to attach one module at a time?
There was a problem hiding this comment.
In case of our platform, we have multiple buses, so the state db entry is added in series, but the reattach function is called only once
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| import swsscommon | ||
| PCIE_DETACH_INFO_TABLE_KEY = PCIE_DETACH_INFO_TABLE+"|"+pcie_string | ||
| if not self.state_db_connector: | ||
| self.state_db_connector = swsscommon.swsscommon.DBConnector("STATE_DB", 0) |
There was a problem hiding this comment.
Check if state_db connection is successful?
| self.pci_entry_state_db(bus, PCIE_OPERATION_DETACHING) | ||
| return self.pci_detach() | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Can you add an error message in case of failure?
| self.pci_entry_state_db(bus, PCIE_OPERATION_ATTACHING) | ||
| return return_value | ||
| except Exception: | ||
| return False |
There was a problem hiding this comment.
Can you add an error message in case of failure?
| os.system("service sensord restart") | ||
|
|
||
| return True | ||
| except Exception as e: |
There was a problem hiding this comment.
Can you add an error message in case of failure?
| os.system("service sensord restart") | ||
|
|
||
| return True | ||
| except Exception as e: |
There was a problem hiding this comment.
Can you add an error message in case of failure?
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sonic_platform_base/module_base.py
Outdated
| def handle_pci_removal(self): | ||
| """ | ||
| Handles PCI device removal by updating state database and detaching device. | ||
| If pci_detach is not implemented, falls back to platform.json based removal. |
There was a problem hiding this comment.
This comment needs to be fixed as we are not having any fallback mechanism now.
sonic_platform_base/module_base.py
Outdated
| def handle_pci_rescan(self): | ||
| """ | ||
| Handles PCI device rescan by updating state database and reattaching device. | ||
| If pci_reattach is not implemented, falls back to platform.json based rescan. |
There was a problem hiding this comment.
Same comment, fallback comment needs to be fixed.
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Cherry-pick PR to 202505: #576 |
Description
As there could be platforms which have multiple PCIE devices per dpu, the module_base implementation is handling the PCIE removal and attachment along with adding the entry details in the PCIE table, so that the pcie daemon which is running will ignore the errors generated from the DPUs,
Motivation and Context
This was done because there are multiple PCIE devices in some platforms and only one in others, we need to have a platform independent method for removal and attach algorithms along with state db entries
How Has This Been Tested?
Additional Information (Optional)