[warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings by ravaliyel · Pull Request #23114 · sonic-net/sonic-mgmt

ravaliyel · 2026-03-19T03:10:32Z

Description of PR

This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation.

Summary:

Modularizes control plane session recovery gating logic into a dedicated helper (controlplane_gating.py)
Introduces a structured, per-HwSKU, per-version JSON thresholds file
Removes all debug and legacy code from the codebase
Adds robust handling and logging for unknown HwSKUs or missing thresholds
Provides a privacy-safe dummy JSON for documentation and public PRs

Fixes # (issue)

Type of change

Back port request

Approach

What is the motivation for this PR?

To provide maintainable, robust, and privacy-compliant gating for control plane session recovery times, and to simplify updates and future extensibility.

How did you do it?

Centralized all control plane session recovery gating logic in a helper file
Created a structured JSON file for thresholds, supporting multiple HwSKUs and version pairs
Removed all debug and legacy code
Added logging for unknown HwSKUs and missing thresholds
Provided a minimal dummy JSON for public PRs

How did you verify/test it?

Unit and integration tested on multiple HwSKUs and version pairs
Validated JSON structure and gating logic with both real and dummy data
Confirmed correct logging and fallback behavior for unknown/missing data

Test result: CI pipeline ID - 1066532

2026-03-20T21:48:12.1062715Z INFO     root:controlplane_gating.py:86 Gating failure: LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1066143Z INFO     root:controlplane_gating.py:86 Gating failure: BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1069915Z ERROR    tests.common.fixtures.advanced_reboot:advanced_reboot.py:690 Post reboot verification failed. List of failures: LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1073646Z BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2
2026-03-20T21:48:49.5327730Z E       [('test_upgrade_path[202505-to-202511-warm-strtk5-DUT]None', ['LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511', 'BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-```

#### Any platform specific information?
No platform-specific logic; all gating is data-driven.

#### Supported testbed topology if it's a new test case?
Not applicable.

### Documentation

- Updated documentation to include the new JSON format and gating logic
- Provided a privacy-safe dummy JSON example for public reference

mssonicbld · 2026-03-19T03:10:40Z

/azp run

azure-pipelines · 2026-03-19T03:10:54Z

Azure Pipelines successfully started running 1 pipeline(s).

wsycqyz · 2026-03-19T04:47:08Z

Hi @ravaliyel, does the 202511 branch need this change as well?

StormLiangMS · 2026-03-19T14:07:53Z

Code Review

🔴 Critical Issues

1. raise SystemExit(...) will kill the entire pytest process (line 70)

raise SystemExit(f"{label} threshold exceeded! Failing pipeline.")

SystemExit terminates the whole pytest run — no cleanup, no other tests, no report generation. This should be:

pytest.fail(f"{label} threshold exceeded: {val:.2f}s > {avg:.2f}s + {wiggle:.2f}s")

or append to gating_failures and return the list, letting the caller decide how to fail.

2. JSON example keys don't match code logic

The code looks up thresholds[hwsku][base][target]["LACP"]["AVG"] and ["BGP"]["AVG"], but hwsku_session_thresholds.json uses "protocol_1" and "protocol_2". Anyone using the example as a template will get silent "No thresholds found" warnings. The example should use "LACP" and "BGP" keys.

3. Missing return [] after warning on lines 55-58

if lacp_avg is None or bgp_avg is None:
    logging.warning(...)
    # falls through to checks loop — should return [] here

After logging the warning for missing thresholds, execution falls through to the checks loop. The loop won't fail (because avg is None fails the and condition), but it's confusing control flow. Add return [] after the warning.

✅ Positive

Clean modular design with data-driven thresholds
Good use of _extract_version() regex for flexible version parsing
Proper logging for unknown HwSKUs

mssonicbld · 2026-03-20T19:57:24Z

/azp run

azure-pipelines · 2026-03-20T19:57:37Z

Azure Pipelines successfully started running 1 pipeline(s).

ravaliyel · 2026-03-20T20:23:59Z

Hi @StormLiangMS, thank you for the review and suggestions. I have made the changes accordingly to use pytest.fail, added return and changed the hwsku json to use LACP and BGP. Please review, thank you

ravaliyel · 2026-03-20T20:25:41Z

Hi @ravaliyel, does the 202511 branch need this change as well?

Hi @wsycqyz, yes this change will be applied for 202511 branch as well. Thank you

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

mssonicbld · 2026-03-20T20:26:37Z

/azp run

azure-pipelines · 2026-03-20T20:26:50Z

Azure Pipelines successfully started running 1 pipeline(s).

yxieca

Code Review

Issues

1. pytest.fail() inside the loop short-circuits on first failure

for label, val, avg, wiggle in checks:
    if val is not None and avg is not None and val > (avg + wiggle):
        gating_failures.append(...)
        pytest.fail(...)  # stops here, never checks remaining items
return gating_failures  # dead code if pytest.fail() is reached

The function builds a gating_failures list and returns it, but pytest.fail() raises immediately, so you'll never see more than one failure and the return value is never used by the caller. Either:

Remove pytest.fail() from the loop and let the caller handle the list, or
Collect all failures first, then call pytest.fail() once at the end with a combined message

2. Return value is unused by the caller

In device_utils.py:

controlplane_gating(gating_input)  # return value discarded

If the intent is to fail via pytest.fail(), the return value is pointless. If the intent is to return failures for the caller to handle, then pytest.fail() shouldn't be inside the function. Pick one pattern.

3. Hardcoded 10s wiggle room may not suit all platforms

LACP_WIGGLE_ROOM = 10.0 and BGP_WIGGLE_ROOM = 10.0 are constants. For platforms with very fast recovery (e.g., AVG=20s), 10s is a 50% margin. For slow ones (AVG=210s), it's <5%. Consider making this a percentage, or moving it into the JSON thresholds per-HwSKU.

4. P95 and MAX thresholds in JSON are unused

The JSON stores AVG, P95, and MAX but only AVG is used. If P95/MAX aren't planned for use, they add confusion. If they are planned, worth noting in a comment.

5. f-strings require Python 3.6+

The codebase generally uses .format() for broader compatibility. Not a blocker but worth noting for consistency.

Minor

import pytest in a utility module creates a hard dependency — if this module is ever imported outside pytest context it will fail
The function name controlplane_gating matches the module name, which can cause confusion with imports

vaibhavhd · 2026-03-20T21:47:47Z

Collect all failures first, then call pytest.fail() once at the end with a combined message

@ravaliyel , comment 1 and 2 seem valid. Can you fix them? I think it is better to Collect all failures first, then call pytest.fail() once at the end with a combined message.

tests/common/platform/hwsku_session_thresholds.json

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

mssonicbld · 2026-03-20T23:08:43Z

/azp run

azure-pipelines · 2026-03-20T23:08:56Z

Azure Pipelines successfully started running 1 pipeline(s).

ravaliyel · 2026-03-20T23:17:25Z

Hi @yxieca, thank you for the review. I have addressed the comments 1 and 2, modified the code to collect all the gating failures, return and append it to the existing verification_errors list.
I have also removed the dependency of f-string and pytest from the gating logic.

@vaibhavhd mentioned that comments 3 and 4 can be addressed in the future PRs as extensions to this gating logic. Thank you.

ravaliyel · 2026-03-20T23:35:13Z

Collect all failures first, then call pytest.fail() once at the end with a combined message

@ravaliyel , comment 1 and 2 seem valid. Can you fix them? I think it is better to Collect all failures first, then call pytest.fail() once at the end with a combined message.

@vaibhavhd I have implemented and updated the logic to collect all failures and the fail with a combined message. Added the test results to the description. Please review, thank you.

…e and police the protocol convergence timings (sonic-net#23114) This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation. How did you do it? Centralized all control plane session recovery gating logic in a helper file Created a structured JSON file for thresholds, supporting multiple HwSKUs and version pairs Removed all debug and legacy code Added logging for unknown HwSKUs and missing thresholds Provided a minimal dummy JSON for public PRs How did you verify/test it? Unit and integration tested on multiple HwSKUs and version pairs Validated JSON structure and gating logic with both real and dummy data Confirmed correct logging and fallback behavior for unknown/missing data Test result: CI pipeline ID - 1066532 Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

github-actions bot requested review from judyjoseph, yutongzhang-microsoft and yxieca March 19, 2026 03:10

Copilot AI mentioned this pull request Mar 19, 2026

Review human-authored PRs opened in the past 24 hours #23134

Draft

ravaliyel added 2 commits March 20, 2026 13:26

Adding control plane gating logic

863966b

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

PR changes

7dd150d

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

ravaliyel force-pushed the ryeluri/control-plane-gating branch from 6657e7d to 7dd150d Compare March 20, 2026 20:26

yxieca requested a review from vaibhavhd March 20, 2026 20:37

yxieca reviewed Mar 20, 2026

View reviewed changes

vaibhavhd changed the title ~~Adding control plane gating logic~~ [warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings Mar 20, 2026

vaibhavhd reviewed Mar 20, 2026

View reviewed changes

tests/common/platform/hwsku_session_thresholds.json Show resolved Hide resolved

PR changes

d92a878

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>

github-actions bot requested a review from yxieca March 20, 2026 23:08

vaibhavhd approved these changes Mar 23, 2026

View reviewed changes

vaibhavhd merged commit 8c54bb1 into sonic-net:master Mar 23, 2026
17 checks passed

Conversation

ravaliyel commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Uh oh!

mssonicbld commented Mar 19, 2026

Uh oh!

azure-pipelines bot commented Mar 19, 2026

Uh oh!

wsycqyz commented Mar 19, 2026

Uh oh!

StormLiangMS commented Mar 19, 2026

Code Review

🔴 Critical Issues

✅ Positive

Uh oh!

mssonicbld commented Mar 20, 2026

Uh oh!

azure-pipelines bot commented Mar 20, 2026

Uh oh!

ravaliyel commented Mar 20, 2026

Uh oh!

ravaliyel commented Mar 20, 2026

Uh oh!

mssonicbld commented Mar 20, 2026

Uh oh!

azure-pipelines bot commented Mar 20, 2026

Uh oh!

yxieca left a comment

Choose a reason for hiding this comment

Code Review

Issues

Minor

Uh oh!

vaibhavhd commented Mar 20, 2026

Uh oh!

Uh oh!

mssonicbld commented Mar 20, 2026

Uh oh!

azure-pipelines bot commented Mar 20, 2026

Uh oh!

ravaliyel commented Mar 20, 2026

Uh oh!

ravaliyel commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ravaliyel commented Mar 19, 2026 •

edited

Loading