Skip to content

Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level#22173

Merged
yxieca merged 3 commits intosonic-net:masterfrom
PriyanshTratiya:fix/high-runtime-cont-link-flap
Feb 19, 2026
Merged

Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level#22173
yxieca merged 3 commits intosonic-net:masterfrom
PriyanshTratiya:fix/high-runtime-cont-link-flap

Conversation

@PriyanshTratiya
Copy link
Contributor

@PriyanshTratiya PriyanshTratiya commented Jan 29, 2026

Description of PR

Summary:
Reduce test_cont_link_flap runtime by flapping a randomly sampled subset (up to 32) of DUT ports and corresponding peer (fanout) ports per iteration, instead of iterating over all connected ports.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • [ x ] Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. The idea is that if it passes for 32 random interfaces across 3 iterations, it will pass overall. This PR aims to keep coverage representative while significantly lowering overall test execution time.

How did you do it?

  • Added a helper get_random_candidates(...) that:
    • builds the full candidate list (admin up + present in connection graph),
    • randomly samples up to 32 candidates,
    • logs the selected ports/candidates for traceability.
  • Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration.
  • Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples.

How did you verify/test it?

  • Ran the updated tests/platform_tests/link_flap/test_cont_link_flap.py::test_cont_link_flap and confirmed:
    • the test executes successfully,
    • only the sampled ports are toggled (validated via logs),
    • runtime is reduced compared to flapping all ports.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@PriyanshTratiya PriyanshTratiya added the Request for 202511 branch Request to backport a change to 202511 branch label Jan 30, 2026
@PriyanshTratiya PriyanshTratiya changed the title randomly sampling 32 int and reuse of code to drive down test runtime drive down continuous link flap test run time by randomly sampling 32 int in each iteration for DUT and peer Feb 3, 2026
"""

@staticmethod
def get_random_candidates(duthost, fanouthosts, num_ports=32):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is mainly intended to reduce the overall execution time of the test cases, but in some scenarios we still need to cover all ports, so we should not change the original testing workflow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Liping, that's true, to address this we were thinking of selecting by bucketing them into 32 buckets and selecting one port from each bucket's based on something like Day of Week so nightly runs would test all ports to see if any port hardware issues are there and cover all ports. To address all ports coverage, we were thinking of adding a batch port shutdown startup in single command which will be parallel as a new test case as flapping 512 ports sequentially is taking too long. We put out a similar PR for test_lldp_syncd test module and wanting to replicate the same thing further for this one, the PR for that is: #22145. We are iterating on it

logging.info("%d Iteration flap all interfaces one by one on DUT", iteration + 1)
port_toggle(duthost, tbinfo, watch=True)
logging.info("%d Iteration flap randomly sampled interfaces one by one on DUT", iteration + 1)
_, selected_ports = self.get_random_candidates(duthost, fanouthosts, num_ports=32)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use an input parameter—for example, based on the value of --completeness_level—to determine the number of target test ports, such as testing all ports, half of the ports, a quarter of the ports, etc.

Copy link
Collaborator

@r12f r12f Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Liping, we could not use the completeness level, because we need all ports to be alive in the test run. Trimming the topology to smaller set is not a good option to us.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @r12f I’m not sure I fully understand what you mean by “we could not use the completeness level”.
The --completeness_level option should already be available, and we can use it to make num_ports more flexible without trimming the topology or changing the original workflow.

Copy link
Contributor Author

@PriyanshTratiya PriyanshTratiya Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completeness_level may be a good idea to add if more specific control is required but for this PR we were thinking to update it toward a clear split of behaviors into two tests:

  1. Keep test_cont_link_flap fast by flapping up to 32 ports when total ports > 32 (and keep existing behavior when ports <= 32). The 32 are selected deterministically with rotation (e.g., bucketized + week-of-year) so all ports get covered over time across runs.
  2. Add a separate “all ports” test that flaps all ports using a bulk/parallel shutdown/no-shutdown approach, so we still cover every port without the long sequential runtime.

So we can implement point 1 with completeness_level (e.g., thorough=all, basic=32), but we were proposing two explicit test cases to keep the intent and runtime characteristics obvious and avoid changing the original “all ports” sequential workflow unless the user explicitly runs the all-ports test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi Liping, our tests requires all ports to be available to expose the problems. trimming the topology disables the ports and hides the problem. We have reduced topo, which does the same things as trimming, but it hides multiple issues to us, so we need the full topo here for testing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Riff, I may be misunderstanding your point, but Completeness_level is only used to determine the number of target ports. The rest of the logic can keep the current change. It should not trim the topology or alter the test sequence.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Liping, I checked the completeness level again, it is different level. this completeness level will affect more cases. Also, the problem here is not reducing the port for "debug", but it is simply does not make sense to run so many ports in even normal cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Riff, thanks for your information.
completeness_level is a global setting for nightly tests and should be one of ["debug", "basic", "confident", "thorough"].
The debug level is intended for engineering testing, while thorough is for full test coverage.
I agree that running too many ports in nightly tests doesn’t make sense. We should reduce the number of target ports in nightly runs to shorten the test time.
Therefore, I suggest using completeness_level to dynamically determine the number of target test ports.
We can follow the solution implemented in the following PR
#5846

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, we just merged a common utility and I’m testing an update on top of that in my dev branch to use completeness_level=thorough to preserve the original behavior. How about, for nightly tests and any other automated call unless specified as thorough, we can default to confident, which uses the Day-of-Week sampling (same approach as #22145) so we get both: original test behavior is preserved, and we also improve performance for nightly runs by using confident which will always be the default option. As soon as its tested, I’ll push a commit along for review

Copy link
Collaborator

@r12f r12f Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @lipxu , again - we cannot run all the test using lower completeness level, we have been hitting things that requires large number of ports to be hit, so the global setting is not going to work for us. It might be more useful for CI, but not our nightly run yet.

@PriyanshTratiya PriyanshTratiya force-pushed the fix/high-runtime-cont-link-flap branch from 4df975b to f77002f Compare February 13, 2026 01:01
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@PriyanshTratiya PriyanshTratiya changed the title drive down continuous link flap test run time by randomly sampling 32 int in each iteration for DUT and peer Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level Feb 13, 2026
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — clean integration with completeness_level framework. DoW-based bucket sampling gives good coverage rotation, and thorough preserves the original all-ports behavior.

@yxieca yxieca merged commit a40da69 into sonic-net:master Feb 19, 2026
20 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 19, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #22480

anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Zhuohui Tan <[email protected]>
mssonicbld pushed a commit that referenced this pull request Feb 20, 2026
…r iteration with completeness level (#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: mssonicbld <[email protected]>
aronovic pushed a commit to aronovic/sonic-mgmt that referenced this pull request Mar 3, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Mihut Aronovici <[email protected]>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 12, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Abhishek <[email protected]>
vrajeshe pushed a commit to vrajeshe/sonic-mgmt that referenced this pull request Mar 23, 2026
…r iteration with completeness level (sonic-net#22173)

Why: The continuous link flap test can take a long time on devices/testbeds
with many connected ports because it flaps every eligible interface on both
DUT and peer across 3 iterations. During msft runs it failed after 3 hours
and during nvidia runs it failed after 9 hours.

How:
- Added a helper get_random_candidates(...) that builds the full candidate
  list (admin up + present in connection graph), randomly samples up to 32
  candidates, and logs the selected ports for traceability.
- Updated the DUT flap loop to call port_toggle(..., ports=selected_ports,
  wait_after_ports_up=30, ...) so only the sampled ports are flapped each
  iteration.
- Updated the peer flap loop to only toggle links for the sampled
  (dut_port, fanout, fanout_port) tuples.

Tested: Ran test_cont_link_flap and confirmed the test executes
successfully, only the sampled ports are toggled (validated via logs),
and runtime is reduced compared to flapping all ports.

Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants