[Mellanox] Fix MST service hang when DPUs are powered off#26549
[Mellanox] Fix MST service hang when DPUs are powered off#26549LuminaScript wants to merge 1 commit intosonic-net:masterfrom
Conversation
Bug: Runtime DPU power-off via dpuctl left stale MST PCI device entries that caused mst status and mlxfwmanager to hang for ~50 seconds. Fix: Move MST lifecycle into firmware upgrade and DPU FPGA upgrade paths. Previously MST was globally started via a systemd service. Signed-off-by: Yizhen Zhang <evazha@nvidia.com>
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR addresses Mellanox/NVIDIA SmartSwitch hangs caused by stale MST PCI entries when DPUs are powered off, by moving MST start/stop lifecycle management into the specific firmware/FPGA upgrade code paths instead of relying on a globally-started systemd service.
Changes:
- Add MST start/stop handling to the Mellanox firmware upgrade coordinator (with an option to continue if MST start fails).
- Update BlueField installer firmware upgrade invocation to use the new “ignore MST start failure” path.
- Update Mellanox platform API DPU FPGA upgrade flow and adjust unit tests accordingly.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| platform/nvidia-bluefield/installer/install.sh.j2 | Switch installer-time firmware manager invocation to use the new MST-handling behavior. |
| platform/mellanox/mlnx-platform-api/tests/test_component.py | Update component tests to reflect new MST device discovery/error behavior. |
| platform/mellanox/mlnx-platform-api/sonic_platform/component.py | Change MST device discovery and add an MST lifecycle context for DPU FPGA updates. |
| platform/mellanox/fw-manager/tests/test_main.py | Update CLI test expectations for the new upgrade handler signature/flag. |
| platform/mellanox/fw-manager/tests/test_firmware_upgrade.py | Patch MST start/stop in coordinator tests to avoid invoking real MST in unit tests. |
| platform/mellanox/fw-manager/tests/test_firmware_coordinator.py | Patch MST start/stop in coordinator tests for queue-processing upgrade scenarios. |
| platform/mellanox/fw-manager/mellanox_fw_manager/main.py | Add --ignore-mst-start-failure CLI flag and plumb it into the coordinator. |
| platform/mellanox/fw-manager/mellanox_fw_manager/firmware_coordinator.py | Implement MST start/stop around the upgrade workflow, with optional “ignore start failure” behavior. |
| platform/mellanox/files/mlnx-fw-manager.service | Remove MST start/stop from systemd unit and add a boot-type ExecCondition. |
Comments suppressed due to low confidence (1)
platform/mellanox/fw-manager/mellanox_fw_manager/firmware_coordinator.py:52
FirmwareCoordinator.__init__()introduces theignore_mst_start_failureparameter, but the docstring’s Args list does not mention it. Please update the docstring to document the new parameter and its behavior so callers understand the new API surface.
def __init__(self, verbose: bool = False, from_image: bool = False, clear_semaphore: bool = False,
ignore_mst_start_failure: bool = False):
"""
Initialize the firmware coordinator.
Args:
verbose: Enable verbose logging
from_image: If True, upgrade from next SONiC image (matches shell script --upgrade)
clear_semaphore: Clear hardware semaphore before upgrade
"""
| output = None | ||
| try: | ||
| output = subprocess.check_output(['/usr/bin/asic_detect/asic_detect.sh', '-p']).decode('utf-8').strip() | ||
| except subprocess.CalledProcessError as e: | ||
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) |
There was a problem hiding this comment.
__get_mst_device() can return an empty string (e.g., when asic_detect.sh -p outputs nothing). That empty value is then used as the --dev argument for cpldupdate, which will fail with a less clear error and makes it hard to distinguish “no MST device” from other failures. Consider treating empty output as an error (raise RuntimeError with a clear message) and also handle OSError/FileNotFoundError from check_output so _install_firmware() can fail gracefully instead of potentially crashing the caller.
| output = None | |
| try: | |
| output = subprocess.check_output(['/usr/bin/asic_detect/asic_detect.sh', '-p']).decode('utf-8').strip() | |
| except subprocess.CalledProcessError as e: | |
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) | |
| try: | |
| output = subprocess.check_output( | |
| ['/usr/bin/asic_detect/asic_detect.sh', '-p'] | |
| ).decode('utf-8').strip() | |
| except subprocess.CalledProcessError as e: | |
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) | |
| except OSError as e: | |
| raise RuntimeError("Failed to execute mst device detection for {}: {}".format(self.name, str(e))) | |
| if not output: | |
| raise RuntimeError("Failed to get {} mst device: empty output from asic_detect.sh".format(self.name)) |
| @contextlib.contextmanager | ||
| def _mst_context(self): | ||
| try: | ||
| subprocess.check_call(['/usr/bin/mst', 'start'], universal_newlines=True) |
There was a problem hiding this comment.
ComponenetFPGADPU._mst_context() starts MST without --with_i2cdev, while the firmware manager path starts MST with --with_i2cdev (see mellanox_fw_manager/firmware_coordinator.py). If i2cdev support is required for these updates, this inconsistency can cause failures. Consider using the same mst start --with_i2cdev invocation here (or document why the plain mst start is sufficient for DPU FPGA upgrades).
| subprocess.check_call(['/usr/bin/mst', 'start'], universal_newlines=True) | |
| subprocess.check_call(['/usr/bin/mst', 'start', '--with_i2cdev'], universal_newlines=True) |
| ExecCondition=/bin/grep -qv '\<SONIC_BOOT_TYPE=fastfast\>' /proc/cmdline | ||
| ExecStart=/usr/local/bin/mlnx-fw-manager --clear-semaphore --verbose | ||
| ExecStop=/usr/bin/mst stop | ||
| TimeoutSec=300 |
There was a problem hiding this comment.
TimeoutSec=300 may be too low for firmware upgrades on multi-ASIC systems. The firmware coordinator uses a 600s-per-ASIC join timeout, so systemd can terminate mlnx-fw-manager before it finishes, potentially leaving the system mid-upgrade (and MST cleanup in the Python finally block won’t run if the process is SIGKILLed). Consider increasing TimeoutSec or setting it based on expected worst-case upgrade time.
| TimeoutSec=300 | |
| TimeoutSec=infinity |
Bug: Runtime DPU power-off via dpuctl left stale MST PCI device entries that caused mst status and mlxfwmanager to hang for ~50 seconds. This issue was fixed on 202511, now apply the same fix to 202606.
Fix: Move MST lifecycle into firmware upgrade and DPU FPGA upgrade paths only. Previously MST was globally started via a systemd service.
Validation
Why I did it
Work item tracking
How I did it
How to verify it
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)