[warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings#23114
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @ravaliyel, does the 202511 branch need this change as well? |
Code Review🔴 Critical Issues1. raise SystemExit(f"{label} threshold exceeded! Failing pipeline.")
pytest.fail(f"{label} threshold exceeded: {val:.2f}s > {avg:.2f}s + {wiggle:.2f}s")or append to 2. JSON example keys don't match code logic The code looks up 3. Missing if lacp_avg is None or bgp_avg is None:
logging.warning(...)
# falls through to checks loop — should return [] hereAfter logging the warning for missing thresholds, execution falls through to the checks loop. The loop won't fail (because ✅ Positive
|
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @StormLiangMS, thank you for the review and suggestions. I have made the changes accordingly to use pytest.fail, added return and changed the hwsku json to use LACP and BGP. Please review, thank you |
Hi @wsycqyz, yes this change will be applied for 202511 branch as well. Thank you |
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
6657e7d to
7dd150d
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
yxieca
left a comment
There was a problem hiding this comment.
Code Review
Issues
1. pytest.fail() inside the loop short-circuits on first failure
for label, val, avg, wiggle in checks:
if val is not None and avg is not None and val > (avg + wiggle):
gating_failures.append(...)
pytest.fail(...) # stops here, never checks remaining items
return gating_failures # dead code if pytest.fail() is reachedThe function builds a gating_failures list and returns it, but pytest.fail() raises immediately, so you'll never see more than one failure and the return value is never used by the caller. Either:
- Remove
pytest.fail()from the loop and let the caller handle the list, or - Collect all failures first, then call
pytest.fail()once at the end with a combined message
2. Return value is unused by the caller
In device_utils.py:
controlplane_gating(gating_input) # return value discardedIf the intent is to fail via pytest.fail(), the return value is pointless. If the intent is to return failures for the caller to handle, then pytest.fail() shouldn't be inside the function. Pick one pattern.
3. Hardcoded 10s wiggle room may not suit all platforms
LACP_WIGGLE_ROOM = 10.0 and BGP_WIGGLE_ROOM = 10.0 are constants. For platforms with very fast recovery (e.g., AVG=20s), 10s is a 50% margin. For slow ones (AVG=210s), it's <5%. Consider making this a percentage, or moving it into the JSON thresholds per-HwSKU.
4. P95 and MAX thresholds in JSON are unused
The JSON stores AVG, P95, and MAX but only AVG is used. If P95/MAX aren't planned for use, they add confusion. If they are planned, worth noting in a comment.
5. f-strings require Python 3.6+
The codebase generally uses .format() for broader compatibility. Not a blocker but worth noting for consistency.
Minor
import pytestin a utility module creates a hard dependency — if this module is ever imported outside pytest context it will fail- The function name
controlplane_gatingmatches the module name, which can cause confusion with imports
@ravaliyel , comment 1 and 2 seem valid. Can you fix them? I think it is better to |
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Hi @yxieca, thank you for the review. I have addressed the comments 1 and 2, modified the code to collect all the gating failures, return and append it to the existing verification_errors list. @vaibhavhd mentioned that comments 3 and 4 can be addressed in the future PRs as extensions to this gating logic. Thank you. |
@vaibhavhd I have implemented and updated the logic to collect all failures and the fail with a combined message. Added the test results to the description. Please review, thank you. |
…e and police the protocol convergence timings (sonic-net#23114) This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation. How did you do it? Centralized all control plane session recovery gating logic in a helper file Created a structured JSON file for thresholds, supporting multiple HwSKUs and version pairs Removed all debug and legacy code Added logging for unknown HwSKUs and missing thresholds Provided a minimal dummy JSON for public PRs How did you verify/test it? Unit and integration tested on multiple HwSKUs and version pairs Validated JSON structure and gating logic with both real and dummy data Confirmed correct logging and fallback behavior for unknown/missing data Test result: CI pipeline ID - 1066532 Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
Description of PR
This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation.
Summary:
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
To provide maintainable, robust, and privacy-compliant gating for control plane session recovery times, and to simplify updates and future extensibility.
How did you do it?
How did you verify/test it?
Test result: CI pipeline ID - 1066532