Skip to content

[Smartswitch] Add module specific pcie attach/detach functions for smartswitch platforms#557

Merged
judyjoseph merged 15 commits intosonic-net:masterfrom
gpunathilell:pcie_changes
May 28, 2025
Merged

[Smartswitch] Add module specific pcie attach/detach functions for smartswitch platforms#557
judyjoseph merged 15 commits intosonic-net:masterfrom
gpunathilell:pcie_changes

Conversation

@gpunathilell
Copy link
Copy Markdown
Contributor

Description

As there could be platforms which have multiple PCIE devices per dpu, the module_base implementation is handling the PCIE removal and attachment along with adding the entry details in the PCIE table, so that the pcie daemon which is running will ignore the errors generated from the DPUs,

Motivation and Context

This was done because there are multiple PCIE devices in some platforms and only one in others, we need to have a platform independent method for removal and attach algorithms along with state db entries

How Has This Been Tested?

Additional Information (Optional)

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@gpunathilell
Copy link
Copy Markdown
Contributor Author

@vvolam @rameshraghupathy Please review

@gpunathilell
Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@rameshraghupathy rameshraghupathy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam @gpunathilell Adding @bmridul to review

@vvolam vvolam requested a review from Copilot April 24, 2025 20:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces new PCI attach/detach functions and related state database operations for smartswitch platforms, along with comprehensive tests for these new functionalities.

  • Added new methods to ModuleBase for handling PCI removal and reattach operations via platform.json.
  • Introduced a file-based locking mechanism for PCI operations and state database updates.
  • Extended test coverage for PCI-related functionalities.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
tests/module_base_test.py Added tests for PCI bus retrieval, state DB entry, locking & PCI operations
sonic_platform_base/module_base.py Added functions for PCI removal/reattach and file-based PCI locking

self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "bus_info", pcie_string)
self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "dpu_state", operation)
except Exception as e:
sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e)))
Copy link

Copilot AI Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a spelling error in the error message ('infoto' should be 'info to'). Please correct the typo.

Suggested change
sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e)))
sys.stderr.write("Failed to write pcie bus info to state database: {}\n".format(str(e)))

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

except Exception as e:
sys.stderr.write("Failed to write pcie bus infoto state database: {}\n".format(str(e)))

def pci_reattach_from_platform_json(self):
Copy link
Copy Markdown
Contributor

@rameshraghupathy rameshraghupathy Apr 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam @gpunathilell This class is a module specific class and we are trying to rescan not only all the modules but also other PCI devices on the box which are not even considered as modules. Can you move this to your module.py ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy Correct me if I am wrong, but there is no other way of performing the rescan (unless the DPUs are under the same parent bus, but doing it in the current method would work even if the DPUs are not under the same bus, so the current implementation is the most generic)
https://github.com/torvalds/linux/blob/7a13c14ee59d4f6c5f4277a86516cbc73a1383a8/Documentation/ABI/testing/sysfs-bus-pci#L74
If that doesn't work then we should be implementing a platform specific pcie_reattach() function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in the HLD meeting, we will look into moving the rescan for the whole PCIE tree to chassis instead of module

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy I had a discussion with Vasundhara, we can move the implementation for the rescan to chassis, but we would have to call both chassis rescan and module rescan separately each time (since the pci bus removal is module specific). So the plan is to remove the independent implementation currently present pci_reattach_from_platform_json and pci_removal_from_platform_json. So pci_detach() and pci_rescan() have to be implemented by vendor.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vvolam @rameshraghupathy I have removed the relevant pcie removal code, Please re review

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Copy Markdown
Contributor

@vvolam vvolam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, other than these minor query/comment

"""
# Device type definition. Note, this is a constant.
DEVICE_TYPE = "module"
PCI_OPERATION_LOCK_FILE_PATH = "/var/lock/{}_pci.lock"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query: Is this file separate for each module?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the lock should be valid across all modules, so we are using a generic lock file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, could you check if this lock is still required?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarification, File lock is required, and the lock is applicable per module, just to prevent reattach of the module while we are removing the module

try:
bus_info_list = self.get_pci_bus_info()
with self._pci_operation_lock():
for bus in bus_info_list:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we do the attach first and then remove the entry just to be safe?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Order is changed

if self.pci_bus_info:
return self.pci_bus_info
try:
with open("/usr/share/sonic/platform/platform.json", 'r') as f:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be "/usr/share/sonic/device/{platform}/platform.json" ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function removed

"""
raise NotImplementedError

def get_pci_bus_from_platform_json(self):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is not used anywhere in this file. This looks like an utility function. Shouldn't this be under utilities?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused function, this is removed

with self._pci_operation_lock():
for bus in bus_info_list:
self.pci_entry_state_db(bus, PCIE_OPERATION_ATTACHING)
return self.pci_reattach()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intent to attach one module at a time?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of our platform, we have multiple buses, so the state db entry is added in series, but the reattach function is called only once

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

import swsscommon
PCIE_DETACH_INFO_TABLE_KEY = PCIE_DETACH_INFO_TABLE+"|"+pcie_string
if not self.state_db_connector:
self.state_db_connector = swsscommon.swsscommon.DBConnector("STATE_DB", 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check if state_db connection is successful?

self.pci_entry_state_db(bus, PCIE_OPERATION_DETACHING)
return self.pci_detach()
except Exception:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an error message in case of failure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

self.pci_entry_state_db(bus, PCIE_OPERATION_ATTACHING)
return return_value
except Exception:
return False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an error message in case of failure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.system("service sensord restart")

return True
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an error message in case of failure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.system("service sensord restart")

return True
except Exception as e:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add an error message in case of failure?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

def handle_pci_removal(self):
"""
Handles PCI device removal by updating state database and detaching device.
If pci_detach is not implemented, falls back to platform.json based removal.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment needs to be fixed as we are not having any fallback mechanism now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def handle_pci_rescan(self):
"""
Handles PCI device rescan by updating state database and reattaching device.
If pci_reattach is not implemented, falls back to platform.json based rescan.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment, fallback comment needs to be fixed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

Cherry-pick PR to 202505: #576

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants