-
Notifications
You must be signed in to change notification settings - Fork 1.3k
PCIe Monitor service #634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
PCIe Monitor service #634
Changes from 3 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
b38bb4c
PCIe Monitor service
sujinmkang 9ad1955
Update pcie monitoring service hld
sujinmkang d92c009
Move the update_state db to pcieutil so that it can be updated
sujinmkang 090d854
review comments
sujinmkang 66cc9a5
Update with rename and state db
sujinmkang 70a152f
review comments
sujinmkang e5defda
update the image link
sujinmkang d35db84
fix the retry number of pcie rescan.
sujinmkang 4e93b8b
review comment
sujinmkang File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,111 @@ | ||
| # SONiC PCIe Monitoring services HLD # | ||
|
|
||
| ### Rev 0.1 ### | ||
|
|
||
| ### Revision | ||
| | Rev | Date | Author | Change Description | | ||
| |:---:|:-----------:|:------------------:|------------------------------------------------| | ||
| | 0.1 | | Sujin Kang | Initial version | | ||
| | 0.2 | | Sujin Kang | Add rescan for pcie device missing during boot | | ||
| | | | | Add pcied to PMON for runtime monitoring | | ||
|
|
||
| ## About This Manual ## | ||
|
|
||
| This document is intend to give the idea of how to monitor the platform PCIe devices and alert any problem on PCIe buses and devices on SONiC using pcie-mon service and pcied on PMON container. | ||
|
|
||
|
|
||
| ## 1. PCIe Monitor service design ## | ||
|
|
||
| New PCIe Monitor service is designed to use the PcieUtil utility to check the current status of PCIe devices and buses and alert if there is any missing devices or any error while communicating on the PCIe buses. | ||
|
|
||
| PCIe device monitoring will be done in two separate services, `pcie-mon.service` which is a systemd service, will monitor the PCIe device during the boot time and `pcied` which is a daemon in PMON container will monitor during the runtime. | ||
|
|
||
| First, pcie-mon.service will be added to check the pcie device enumeration status, trigger the pci device rescan if there is any missing device and indicate any device missing to the party that are interested in the device enumeration, for example, kernel_bde driver, platform drivers and etc. | ||
|
|
||
| Second, pcid in PMON will perform the periodic pcie device check during the run time. | ||
|
|
||
| Both pcie-mon.service and pcied will update the state db with the PCIe device status whenever it changes. | ||
|
|
||
| ### 1.1 Access the PCIe devices and buses from platform ### | ||
|
|
||
| PCIe device information can be accessed via read files under (e.g. `/sys/bus/pci/devices/0000:01:00.1`), different vendors may have under different folders, these folder need to be mounted to platform container so pcied can access them. | ||
|
|
||
| For the convenience of implementation and reduce the time consuming, pcie-mon.service will use the `pcieutil` which is the pcie diag tool. `pcieutil` is implemented based on platform_base.sonic_pcie.`PcieUtil` class. | ||
|
|
||
| 1. `pcieutil` should get the platform specific PCIe device information and monitor the PCIe device and bus status with PcieUtil.get_pcie_check and update the STATE_DB based on get_pcie_check results. | ||
|
|
||
| 2. `PcieUtil` will provide APIs `load_config_file`, `get_pcie_device` and `get_pcie_check` to get the expected PCIe device list and informations, to get the current PCIe device information, and check if any PCIe device is missing or if there is any PCIe bus error. | ||
|
|
||
|  | ||
|
|
||
| ### 1.2 PCIe device configuration file ### | ||
|
|
||
| PcieUtil needs to get the expected PCIe device information to check the PCIe device status periodically, which is different for each platform/hardware sku. | ||
|
|
||
| Each vendor need to generate the PCIe device configuration file name as pcie.yml and locate the file under device/<platform>/<hardware_skus>/plugins. | ||
|
|
||
| Example) Location: `device/celestica/x86_64-cel_seastone-r0/plugins/pcie.yaml` | ||
|
|
||
| ``` | ||
| ... | ||
| - bus: '01' | ||
| dev: '00' | ||
| fn: '0' | ||
| id: b960 | ||
| name: 'Ethernet controller: Broadcom Limited Broadcom BCM56960 Switch ASIC' | ||
| - bus: '01' | ||
| dev: '00' | ||
| fn: '1' | ||
| id: b960 | ||
| name: 'Ethernet controller: Broadcom Limited Broadcom BCM56960 Switch ASIC' | ||
| ``` | ||
|
|
||
| ### 1.3 PCIe device status check ### | ||
|
|
||
|
|
||
| The default PCIe device check function, get_pcie_check is implemented in PcieUtil class at sonic_platform_base/sonic_pcie/pcie_common.py. | ||
| It loads the PCIe device configuration file and compares them with the enumerated devices based on the platform sysfs device tree under /sys/bus/pci/devices/. | ||
|
|
||
| Here we define a common platform API to in class `PcieBase`: | ||
|
|
||
| @abc.abstractmethod | ||
| def get_pcie_check(self, timeout=0): | ||
| """ | ||
| Check Pcie device with config file | ||
| Returns: | ||
| A list including pcie device and test result info | ||
| """ | ||
| return [] | ||
|
|
||
| Each vendor need to implement this function in `PcieBase` plugin if vendor has any additional pcie healthy check method. | ||
|
|
||
| PcieUtil calls this API to check the PCIe device status, following example code showing how this API will be called: | ||
|
|
||
| while True: | ||
| status, device_dict = platform_pcieutil.get_pcie_check() | ||
| if(status): | ||
| for key, value in device_dict.iteritems(): | ||
| print("Device on PCIe bus: %s" was %s" % (key, value)) | ||
|
|
||
| ### 1.4 PCIe Monitor Service `pcie-mon.service` flow ### | ||
|
|
||
| pcie-mon.service will be started by systemd during boot up and it will spawn a thread to check PCIe device status and perform the rescan pci devices if there is any missing devices after rc.local.service is completed and it will update the state db with pcie device satus during the `pcieutil pcie-chek` call so that the dependent services/container or kernel driver can be started or stopped based on the status. | ||
|
|
||
| Detailed flow as showed in below chart: | ||
|  | ||
|
|
||
|
|
||
| ### 1.5 PCIe daemon `pcied` flow ### | ||
|
|
||
| pcied will be started by PMON container will continue monitoring the PCIe device status during run time and it will check the PCIe device status periodically every 1 minute and update the state db when the status is checked. | ||
|
|
||
| Detailed flow as showed in below chart: | ||
|  | ||
|
|
||
|
|
||
| < TBA > | ||
| ## Open Questions ## | ||
|
|
||
| 1. Current PcieUtil is limited to check the PCIe device availablility based on the configuration. | ||
| Can we also add the PCIe communication error status check using AER detection into get_pcie_check() api or with a separate api? | ||
| some plugins like, say, collectd (https://wiki.opnfv.org/display/fastpath/PCIe+Advanced+Error+Reporting+Plugin) | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.