-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[Mellanox] Fix MST service hang when DPUs are powered off #26549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -1,6 +1,6 @@ | ||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||
| # SPDX-FileCopyrightText: NVIDIA CORPORATION & AFFILIATES | ||||||||||||||||||||||||||||||||||
| # Copyright (c) 2019-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||||||||||||||||||||||||||||||
| # Copyright (c) 2019-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. | ||||||||||||||||||||||||||||||||||
| # Apache-2.0 | ||||||||||||||||||||||||||||||||||
| # | ||||||||||||||||||||||||||||||||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||||||||||||||||||||||||||||||||||
|
|
@@ -23,6 +23,7 @@ | |||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||||
| import contextlib | ||||||||||||||||||||||||||||||||||
| import os | ||||||||||||||||||||||||||||||||||
| import io | ||||||||||||||||||||||||||||||||||
| import re | ||||||||||||||||||||||||||||||||||
|
|
@@ -763,26 +764,21 @@ def __init__(self, idx): | |||||||||||||||||||||||||||||||||
| self.image_ext_name = self.COMPONENT_FIRMWARE_EXTENSION | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| def __get_mst_device(self): | ||||||||||||||||||||||||||||||||||
| if not os.path.exists(self.MST_DEVICE_PATH): | ||||||||||||||||||||||||||||||||||
| print("ERROR: mst driver is not loaded") | ||||||||||||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| pattern = os.path.join(self.MST_DEVICE_PATH, self.MST_DEVICE_PATTERN) | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| mst_dev_list = glob.glob(pattern) | ||||||||||||||||||||||||||||||||||
| if not mst_dev_list or len(mst_dev_list) != 1: | ||||||||||||||||||||||||||||||||||
| devices = str(os.listdir(self.MST_DEVICE_PATH)) | ||||||||||||||||||||||||||||||||||
| print("ERROR: Failed to get mst device: pattern={}, devices={}".format(pattern, devices)) | ||||||||||||||||||||||||||||||||||
| return None | ||||||||||||||||||||||||||||||||||
|
|
||||||||||||||||||||||||||||||||||
| return mst_dev_list[0] | ||||||||||||||||||||||||||||||||||
| output = None | ||||||||||||||||||||||||||||||||||
| try: | ||||||||||||||||||||||||||||||||||
| output = subprocess.check_output(['/usr/bin/asic_detect/asic_detect.sh', '-p']).decode('utf-8').strip() | ||||||||||||||||||||||||||||||||||
| except subprocess.CalledProcessError as e: | ||||||||||||||||||||||||||||||||||
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) | ||||||||||||||||||||||||||||||||||
|
Comment on lines
+767
to
+771
|
||||||||||||||||||||||||||||||||||
| output = None | |
| try: | |
| output = subprocess.check_output(['/usr/bin/asic_detect/asic_detect.sh', '-p']).decode('utf-8').strip() | |
| except subprocess.CalledProcessError as e: | |
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) | |
| try: | |
| output = subprocess.check_output( | |
| ['/usr/bin/asic_detect/asic_detect.sh', '-p'] | |
| ).decode('utf-8').strip() | |
| except subprocess.CalledProcessError as e: | |
| raise RuntimeError("Failed to get {} mst device: {}".format(self.name, str(e))) | |
| except OSError as e: | |
| raise RuntimeError("Failed to execute mst device detection for {}: {}".format(self.name, str(e))) | |
| if not output: | |
| raise RuntimeError("Failed to get {} mst device: empty output from asic_detect.sh".format(self.name)) |
Copilot
AI
Apr 3, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ComponenetFPGADPU._mst_context() starts MST without --with_i2cdev, while the firmware manager path starts MST with --with_i2cdev (see mellanox_fw_manager/firmware_coordinator.py). If i2cdev support is required for these updates, this inconsistency can cause failures. Consider using the same mst start --with_i2cdev invocation here (or document why the plain mst start is sufficient for DPU FPGA upgrades).
| subprocess.check_call(['/usr/bin/mst', 'start'], universal_newlines=True) | |
| subprocess.check_call(['/usr/bin/mst', 'start', '--with_i2cdev'], universal_newlines=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TimeoutSec=300may be too low for firmware upgrades on multi-ASIC systems. The firmware coordinator uses a 600s-per-ASIC join timeout, so systemd can terminatemlnx-fw-managerbefore it finishes, potentially leaving the system mid-upgrade (and MST cleanup in the Pythonfinallyblock won’t run if the process is SIGKILLed). Consider increasingTimeoutSecor setting it based on expected worst-case upgrade time.