Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level by PriyanshTratiya · Pull Request #22173 · sonic-net/sonic-mgmt

PriyanshTratiya · 2026-01-29T20:06:39Z

Description of PR

Summary:
Reduce test_cont_link_flap runtime by flapping a randomly sampled subset (up to 32) of DUT ports and corresponding peer (fanout) ports per iteration, instead of iterating over all connected ports.

Type of change

Bug fix
Testbed and Framework(new/improvement)
New Test case
- Skipped for non-supported platforms
[ x ] Test case improvement

Back port request

Approach

What is the motivation for this PR?

The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. The idea is that if it passes for 32 random interfaces across 3 iterations, it will pass overall. This PR aims to keep coverage representative while significantly lowering overall test execution time.

How did you do it?

Added a helper get_random_candidates(...) that:
- builds the full candidate list (admin up + present in connection graph),
- randomly samples up to 32 candidates,
- logs the selected ports/candidates for traceability.
Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration.
Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples.

How did you verify/test it?

Ran the updated tests/platform_tests/link_flap/test_cont_link_flap.py::test_cont_link_flap and confirmed:
- the test executes successfully,
- only the sampled ports are toggled (validated via logs),
- runtime is reduced compared to flapping all ports.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

mssonicbld · 2026-01-29T20:08:08Z

/azp run

azure-pipelines · 2026-01-29T20:08:23Z

Azure Pipelines successfully started running 1 pipeline(s).

lipxu · 2026-02-03T23:54:37Z

tests/platform_tests/link_flap/test_cont_link_flap.py

    """

+    @staticmethod
+    def get_random_candidates(duthost, fanouthosts, num_ports=32):


This PR is mainly intended to reduce the overall execution time of the test cases, but in some scenarios we still need to cover all ports, so we should not change the original testing workflow.

Hi Liping, that's true, to address this we were thinking of selecting by bucketing them into 32 buckets and selecting one port from each bucket's based on something like Day of Week so nightly runs would test all ports to see if any port hardware issues are there and cover all ports. To address all ports coverage, we were thinking of adding a batch port shutdown startup in single command which will be parallel as a new test case as flapping 512 ports sequentially is taking too long. We put out a similar PR for test_lldp_syncd test module and wanting to replicate the same thing further for this one, the PR for that is: #22145. We are iterating on it

lipxu · 2026-02-04T00:08:53Z

tests/platform_tests/link_flap/test_cont_link_flap.py

-            logging.info("%d Iteration flap all interfaces one by one on DUT", iteration + 1)
-            port_toggle(duthost, tbinfo, watch=True)
+            logging.info("%d Iteration flap randomly sampled interfaces one by one on DUT", iteration + 1)
+            _, selected_ports = self.get_random_candidates(duthost, fanouthosts, num_ports=32)


Could we use an input parameter—for example, based on the value of --completeness_level—to determine the number of target test ports, such as testing all ports, half of the ports, a quarter of the ports, etc.

hi Liping, we could not use the completeness level, because we need all ports to be alive in the test run. Trimming the topology to smaller set is not a good option to us.

Hi @r12f I’m not sure I fully understand what you mean by “we could not use the completeness level”.
The --completeness_level option should already be available, and we can use it to make num_ports more flexible without trimming the topology or changing the original workflow.

Completeness_level may be a good idea to add if more specific control is required but for this PR we were thinking to update it toward a clear split of behaviors into two tests:

Keep test_cont_link_flap fast by flapping up to 32 ports when total ports > 32 (and keep existing behavior when ports <= 32). The 32 are selected deterministically with rotation (e.g., bucketized + week-of-year) so all ports get covered over time across runs.

Add a separate “all ports” test that flaps all ports using a bulk/parallel shutdown/no-shutdown approach, so we still cover every port without the long sequential runtime.

So we can implement point 1 with completeness_level (e.g., thorough=all, basic=32), but we were proposing two explicit test cases to keep the intent and runtime characteristics obvious and avoid changing the original “all ports” sequential workflow unless the user explicitly runs the all-ports test.

hi Liping, our tests requires all ports to be available to expose the problems. trimming the topology disables the ports and hides the problem. We have reduced topo, which does the same things as trimming, but it hides multiple issues to us, so we need the full topo here for testing.

Hi Riff, I may be misunderstanding your point, but Completeness_level is only used to determine the number of target ports. The rest of the logic can keep the current change. It should not trim the topology or alter the test sequence.

Hi Liping, I checked the completeness level again, it is different level. this completeness level will affect more cases. Also, the problem here is not reducing the port for "debug", but it is simply does not make sense to run so many ports in even normal cases.

Hi Riff, thanks for your information.
completeness_level is a global setting for nightly tests and should be one of ["debug", "basic", "confident", "thorough"].
The debug level is intended for engineering testing, while thorough is for full test coverage.
I agree that running too many ports in nightly tests doesn’t make sense. We should reduce the number of target ports in nightly runs to shorten the test time.
Therefore, I suggest using completeness_level to dynamically determine the number of target test ports.
We can follow the solution implemented in the following PR
#5846

Hi, we just merged a common utility and I’m testing an update on top of that in my dev branch to use completeness_level=thorough to preserve the original behavior. How about, for nightly tests and any other automated call unless specified as thorough, we can default to confident, which uses the Day-of-Week sampling (same approach as #22145) so we get both: original test behavior is preserved, and we also improve performance for nightly runs by using confident which will always be the default option. As soon as its tested, I’ll push a commit along for review

hi @lipxu , again - we cannot run all the test using lower completeness level, we have been hitting things that requires large number of ports to be hit, so the global setting is not going to work for us. It might be more useful for CI, but not our nightly run yet.

Signed-off-by: Priyansh Tratiya <[email protected]>

mssonicbld · 2026-02-13T01:01:11Z

/azp run

azure-pipelines · 2026-02-13T01:01:28Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

LGTM — clean integration with completeness_level framework. DoW-based bucket sampling gives good coverage rotation, and thorough preserves the original all-ports behavior.

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: mssonicbld <[email protected]>

mssonicbld · 2026-02-19T19:35:18Z

Cherry-pick PR to 202511: #22480

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Zhuohui Tan <[email protected]>

…r iteration with completeness level (#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: mssonicbld <[email protected]>

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Mihut Aronovici <[email protected]>

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Abhishek <[email protected]>

…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>

github-actions bot requested review from xwjiang-ms, yutongzhang-microsoft and yxieca January 29, 2026 20:07

PriyanshTratiya removed request for xwjiang-ms, yutongzhang-microsoft and yxieca January 29, 2026 20:07

PriyanshTratiya added the Request for 202511 branch Request to backport a change to 202511 branch label Jan 30, 2026

PriyanshTratiya changed the title ~~randomly sampling 32 int and reuse of code to drive down test runtime~~ drive down continuous link flap test run time by randomly sampling 32 int in each iteration for DUT and peer Feb 3, 2026

lipxu reviewed Feb 4, 2026

View reviewed changes

PriyanshTratiya added 3 commits February 12, 2026 11:30

randomly sampling 32 int and reuse of code

7e128e1

Signed-off-by: Priyansh Tratiya <[email protected]>

adding control w completeness level and using DoW sampling

c861125

Signed-off-by: Priyansh Tratiya <[email protected]>

removing random import

f77002f

Signed-off-by: Priyansh Tratiya <[email protected]>

PriyanshTratiya force-pushed the fix/high-runtime-cont-link-flap branch from 4df975b to f77002f Compare February 13, 2026 01:01

github-actions bot requested review from xwjiang-ms, yutongzhang-microsoft and yxieca February 13, 2026 01:01

PriyanshTratiya changed the title ~~drive down continuous link flap test run time by randomly sampling 32 int in each iteration for DUT and peer~~ Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level Feb 13, 2026

PriyanshTratiya requested review from lipxu and r12f February 13, 2026 01:03

vmittal-msft added the Approved for 202511 branch label Feb 19, 2026

yxieca approved these changes Feb 19, 2026

View reviewed changes

yxieca merged commit a40da69 into sonic-net:master Feb 19, 2026
20 checks passed

mssonicbld added the Created PR to 202511 branch label Feb 19, 2026

mssonicbld mentioned this pull request Feb 19, 2026

[action] [PR:22173] Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level #22480

Merged

11 tasks

mssonicbld added Included in 202511 branch and removed Created PR to 202511 branch labels Feb 20, 2026

Conversation

PriyanshTratiya commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Jan 29, 2026

Uh oh!

azure-pipelines bot commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r12f Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PriyanshTratiya Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

r12f Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mssonicbld commented Feb 13, 2026

Uh oh!

azure-pipelines bot commented Feb 13, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mssonicbld commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

PriyanshTratiya commented Jan 29, 2026 •

edited

Loading

r12f Feb 5, 2026 •

edited

Loading

PriyanshTratiya Feb 6, 2026 •

edited

Loading

r12f Feb 19, 2026 •

edited

Loading