Skip to content

Replace per port SFP presence check with global command in check_interface_status#21074

Closed
mihirpat1 wants to merge 2 commits intosonic-net:masterfrom
mihirpat1:global_sfp_presence_check
Closed

Replace per port SFP presence check with global command in check_interface_status#21074
mihirpat1 wants to merge 2 commits intosonic-net:masterfrom
mihirpat1:global_sfp_presence_check

Conversation

@mihirpat1
Copy link
Contributor

Description of PR

The test_restart_swss is intermittently failing with "Not all interface information are detected within 300 seconds" error on some HWSKUs.

This PR optimizes the SFP (transceiver) presence checking logic in interface_utils.py by replacing individual per-port commands with a single global command that retrieves presence status for all interfaces at once.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

Fix an intermittent failure in test_restart_swss testcase.

How did you do it?

The issue seems to stem from a significant delay between capturing the link status of a port and actually checking it. In the check_interface_status function (https://github.com/sonic-net/sonic-mgmt/blob/c7d26dbee7a15e97ccb209b8462d06f11c050d5c/tests/common/platform/interface_utils.py#L88C5-L88C27), the process is:

  1. Run show interface description to get the current status.
  2. Loop through each port to:
    • Confirm both admin and oper status are up.
    • Run show interface transceiver presence EthernetXX for each port - This takes ~1s for every port

On HWSKUs with many ports (e.g., 512 logical ports), this can lead to a situation where the oper status for the last few ports appears down simply because too much time has passed since the initial status capture. Unfortunately, by the time the loop reaches those ports, the 300s timeout (

pytest_assert(wait_until(interface_wait_time, 20, 0, check_interface_information, dut,
) may already be exceeded, preventing retries.

In order to fix this issue, a single global show interfaces transceiver presence command is executed instead of per-port command.

How did you verify/test it?

Ran the failing testcase and ensured that the test passes with the fix

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

ADO - 34399249

@mihirpat1 mihirpat1 requested a review from Copilot October 24, 2025 01:45
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the SFP transceiver presence checking in test_restart_swss to fix intermittent test failures on HWSKUs with many ports. The previous implementation checked each port individually, causing cumulative delays that exceeded the 300-second timeout on systems with 512+ logical ports.

Key changes:

  • Replaced per-port show interface transceiver presence <port> commands with a single global command
  • Eliminated the ~1 second delay per port that was causing timeout issues on large port configurations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mssonicbld
Copy link
Collaborator

/azp run

@mihirpat1 mihirpat1 requested a review from Copilot October 24, 2025 02:05
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@mihirpat1 mihirpat1 force-pushed the global_sfp_presence_check branch from c501111 to 4793b75 Compare October 24, 2025 19:36
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 marked this pull request as ready for review October 24, 2025 19:57
@mihirpat1 mihirpat1 force-pushed the global_sfp_presence_check branch from 4793b75 to 8b1ab35 Compare October 27, 2025 13:43
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@az-pz az-pz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@mihirpat1
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 21074 in repo sonic-net/sonic-mgmt

@mihirpat1
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mihirpat1 mihirpat1 force-pushed the global_sfp_presence_check branch from 8b1ab35 to 429a38f Compare October 31, 2025 23:27
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor
Copy link
Contributor

prgeor commented Nov 5, 2025

@mihirpat1 can you run this test again? seem this fixed the issue? https://github.com/sonic-net/sonic-mgmt/pull/21055/files

@mihirpat1
Copy link
Contributor Author

Similar fix has been merged via #21055.
Hence, closing this PR

@mihirpat1 mihirpat1 closed this Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants