test_qos_sai: prevent cascading failures after fixture error#23283
test_qos_sai: prevent cascading failures after fixture error#23283darius-nexthop wants to merge 2 commits intosonic-net:masterfrom
Conversation
When testParameter fixture setup fails, all subsequent tests in that parameter set are currently reported as ERROR, which obscures the root cause and tanks pass rate. Add pytest hooks to detect fixture setup failure in test_qos_sai parameter sets and skim remaining tests in that parameter set as SKIPPED. This preserves clear ERRORs for the root cause while avoiding cascading failures and reducing the blast radius in the test report. Signed-off-by: Darius Grassi <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
Signed-off-by: Darius Grassi <[email protected]>
99c00a7 to
28652f4
Compare
|
/azp run |
StormLiangMS
left a comment
There was a problem hiding this comment.
@darius-nexthop —
Template: ✅ OK
DCO: ✅ signed
CI: ✅ all green
Code Review:
[Important] Module-level _fixture_failures dict persists across test runs
conftest.py:178:
_fixture_failures = {}This module-level dict is never cleared between pytest sessions. If conftest.py is imported once and pytest runs multiple times (e.g., in a long-lived worker process, or with pytest-xdist), stale entries from a previous run will cause tests to be skipped in subsequent runs even if the fixture is now healthy.
Suggestion: Clear the dict at session start using a pytest_sessionstart hook:
def pytest_sessionstart(session):
_fixture_failures.clear()[Important] Parameter set extraction is fragile
params = test_name.split('[')[1].rstrip(']')
param_set = params.split('-')[0]This assumes the first --delimited token is the fixture parameter set. But test names can be like testQosSaiDscpEcn[single_asic-ptf_mode] or testParameter[multi-asic] where multi-asic contains a hyphen. "multi-asic".split('-')[0] → "multi", losing the actual parameter set.
Suggestion: Use a more robust extraction — either match known parameter set names explicitly, or use the full parameterization string as the key.
[Minor] testParameter exclusion is hardcoded
if 'testParameter' in item.name:
returnIf the test is renamed or another "seed" test is added, this breaks. Consider using a pytest marker like @pytest.mark.fixture_seed instead.
Overall: The concept is sound — cascading fixture failures waste test time. The implementation needs the session-start cleanup and more robust parameter parsing.
StormLiangMS
left a comment
There was a problem hiding this comment.
@darius-nexthop — The module-level _fixture_failures = {} dict is never cleared between pytest sessions. In long-lived workers or xdist, stale entries will incorrectly skip tests in subsequent runs.
Please add a pytest_sessionstart hook to clear it:
def pytest_sessionstart(session):
_fixture_failures.clear()Also, the parameter set extraction (split("-")[0]) breaks on names like multi-asic. Please use a more robust parsing approach.
Description of PR
Summary:
Prevent cascading
ERRORstatuses intests/qos/test_qos_sai.pywhen thetestParameterfixture setup fails, by skipping remaining tests in the affected parameter set.Fixes #23282
Type of change
Back port request
Approach
What is the motivation for this PR?
A single infrastructure/fixture failure in testParameter currently causes all subsequent tests in the same parameter set to report as ERROR, which obscures the true root cause and significantly inflates the reported failure count for QoS SAI runs.
How did you do it?
Added pytest hooks (
pytest_runtest_makereportandpytest_runtest_setup) that detect fixture setup failures per parameter set and track them in a small global map; once a parameter set is known-bad, subsequent tests in that set are short-circuited and reported as SKIPPED instead of re-running the broken setup and cascading more ERRORs.How did you verify/test it?
Ran
tests/qos/test_qos_sai.pyon a setup wheretestParameterreliably fails during fixture setup and confirmed that only the firsttestParameterfor each parameter set reportsERROR, while all remaining tests in those parameter sets are reported asSKIPPED, reducing the output from ~130+ cascadingERRORs to a small, fixed set ofERRORs plus cleanSKIPPEDstatuses.Any platform specific information?
N/A
Supported testbed topology if it's a new test case?
Documentation