Reduce continuous link flap test runtime by sampling 32 interfaces per iteration with completeness level#22173
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| """ | ||
|
|
||
| @staticmethod | ||
| def get_random_candidates(duthost, fanouthosts, num_ports=32): |
There was a problem hiding this comment.
This PR is mainly intended to reduce the overall execution time of the test cases, but in some scenarios we still need to cover all ports, so we should not change the original testing workflow.
There was a problem hiding this comment.
Hi Liping, that's true, to address this we were thinking of selecting by bucketing them into 32 buckets and selecting one port from each bucket's based on something like Day of Week so nightly runs would test all ports to see if any port hardware issues are there and cover all ports. To address all ports coverage, we were thinking of adding a batch port shutdown startup in single command which will be parallel as a new test case as flapping 512 ports sequentially is taking too long. We put out a similar PR for test_lldp_syncd test module and wanting to replicate the same thing further for this one, the PR for that is: #22145. We are iterating on it
| logging.info("%d Iteration flap all interfaces one by one on DUT", iteration + 1) | ||
| port_toggle(duthost, tbinfo, watch=True) | ||
| logging.info("%d Iteration flap randomly sampled interfaces one by one on DUT", iteration + 1) | ||
| _, selected_ports = self.get_random_candidates(duthost, fanouthosts, num_ports=32) |
There was a problem hiding this comment.
Could we use an input parameter—for example, based on the value of --completeness_level—to determine the number of target test ports, such as testing all ports, half of the ports, a quarter of the ports, etc.
There was a problem hiding this comment.
hi Liping, we could not use the completeness level, because we need all ports to be alive in the test run. Trimming the topology to smaller set is not a good option to us.
There was a problem hiding this comment.
Hi @r12f I’m not sure I fully understand what you mean by “we could not use the completeness level”.
The --completeness_level option should already be available, and we can use it to make num_ports more flexible without trimming the topology or changing the original workflow.
There was a problem hiding this comment.
Completeness_level may be a good idea to add if more specific control is required but for this PR we were thinking to update it toward a clear split of behaviors into two tests:
- Keep test_cont_link_flap fast by flapping up to 32 ports when total ports > 32 (and keep existing behavior when ports <= 32). The 32 are selected deterministically with rotation (e.g., bucketized + week-of-year) so all ports get covered over time across runs.
- Add a separate “all ports” test that flaps all ports using a bulk/parallel shutdown/no-shutdown approach, so we still cover every port without the long sequential runtime.
So we can implement point 1 with completeness_level (e.g., thorough=all, basic=32), but we were proposing two explicit test cases to keep the intent and runtime characteristics obvious and avoid changing the original “all ports” sequential workflow unless the user explicitly runs the all-ports test.
There was a problem hiding this comment.
hi Liping, our tests requires all ports to be available to expose the problems. trimming the topology disables the ports and hides the problem. We have reduced topo, which does the same things as trimming, but it hides multiple issues to us, so we need the full topo here for testing.
There was a problem hiding this comment.
Hi Riff, I may be misunderstanding your point, but Completeness_level is only used to determine the number of target ports. The rest of the logic can keep the current change. It should not trim the topology or alter the test sequence.
There was a problem hiding this comment.
Hi Liping, I checked the completeness level again, it is different level. this completeness level will affect more cases. Also, the problem here is not reducing the port for "debug", but it is simply does not make sense to run so many ports in even normal cases.
There was a problem hiding this comment.
Hi Riff, thanks for your information.
completeness_level is a global setting for nightly tests and should be one of ["debug", "basic", "confident", "thorough"].
The debug level is intended for engineering testing, while thorough is for full test coverage.
I agree that running too many ports in nightly tests doesn’t make sense. We should reduce the number of target ports in nightly runs to shorten the test time.
Therefore, I suggest using completeness_level to dynamically determine the number of target test ports.
We can follow the solution implemented in the following PR
#5846
There was a problem hiding this comment.
Hi, we just merged a common utility and I’m testing an update on top of that in my dev branch to use completeness_level=thorough to preserve the original behavior. How about, for nightly tests and any other automated call unless specified as thorough, we can default to confident, which uses the Day-of-Week sampling (same approach as #22145) so we get both: original test behavior is preserved, and we also improve performance for nightly runs by using confident which will always be the default option. As soon as its tested, I’ll push a commit along for review
There was a problem hiding this comment.
hi @lipxu , again - we cannot run all the test using lower completeness level, we have been hitting things that requires large number of ports to be hit, so the global setting is not going to work for us. It might be more useful for CI, but not our nightly run yet.
Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Priyansh Tratiya <[email protected]>
Signed-off-by: Priyansh Tratiya <[email protected]>
4df975b to
f77002f
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
LGTM — clean integration with completeness_level framework. DoW-based bucket sampling gives good coverage rotation, and thorough preserves the original all-ports behavior.
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: mssonicbld <[email protected]>
|
Cherry-pick PR to 202511: #22480 |
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Zhuohui Tan <[email protected]>
…r iteration with completeness level (#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: mssonicbld <[email protected]>
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Mihut Aronovici <[email protected]>
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Abhishek <[email protected]>
…r iteration with completeness level (sonic-net#22173) Why: The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. How: - Added a helper get_random_candidates(...) that builds the full candidate list (admin up + present in connection graph), randomly samples up to 32 candidates, and logs the selected ports for traceability. - Updated the DUT flap loop to call port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...) so only the sampled ports are flapped each iteration. - Updated the peer flap loop to only toggle links for the sampled (dut_port, fanout, fanout_port) tuples. Tested: Ran test_cont_link_flap and confirmed the test executes successfully, only the sampled ports are toggled (validated via logs), and runtime is reduced compared to flapping all ports. Signed-off-by: Priyansh Tratiya <[email protected]> Signed-off-by: Venkata Gouri Rajesh Etla <[email protected]>
Description of PR
Summary:
Reduce
test_cont_link_flapruntime by flapping a randomly sampled subset (up to 32) of DUT ports and corresponding peer (fanout) ports per iteration, instead of iterating over all connected ports.Type of change
Back port request
Approach
What is the motivation for this PR?
The continuous link flap test can take a long time on devices/testbeds with many connected ports because it flaps every eligible interface on both DUT and peer across 3 iterations. During msft runs it failed after 3 hours and during nvidia runs it failed after 9 hours. The idea is that if it passes for 32 random interfaces across 3 iterations, it will pass overall. This PR aims to keep coverage representative while significantly lowering overall test execution time.
How did you do it?
get_random_candidates(...)that:port_toggle(..., ports=selected_ports, wait_after_ports_up=30, ...)so only the sampled ports are flapped each iteration.How did you verify/test it?
tests/platform_tests/link_flap/test_cont_link_flap.py::test_cont_link_flapand confirmed:Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation