Skip to content

[warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings#23114

Merged
vaibhavhd merged 3 commits intosonic-net:masterfrom
ravaliyel:ryeluri/control-plane-gating
Mar 23, 2026
Merged

[warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings#23114
vaibhavhd merged 3 commits intosonic-net:masterfrom
ravaliyel:ryeluri/control-plane-gating

Conversation

@ravaliyel
Copy link
Contributor

@ravaliyel ravaliyel commented Mar 19, 2026

Description of PR

This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation.

Summary:

  • Modularizes control plane session recovery gating logic into a dedicated helper (controlplane_gating.py)
  • Introduces a structured, per-HwSKU, per-version JSON thresholds file
  • Removes all debug and legacy code from the codebase
  • Adds robust handling and logging for unknown HwSKUs or missing thresholds
  • Provides a privacy-safe dummy JSON for documentation and public PRs

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework (new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

To provide maintainable, robust, and privacy-compliant gating for control plane session recovery times, and to simplify updates and future extensibility.

How did you do it?

  • Centralized all control plane session recovery gating logic in a helper file
  • Created a structured JSON file for thresholds, supporting multiple HwSKUs and version pairs
  • Removed all debug and legacy code
  • Added logging for unknown HwSKUs and missing thresholds
  • Provided a minimal dummy JSON for public PRs

How did you verify/test it?

  • Unit and integration tested on multiple HwSKUs and version pairs
  • Validated JSON structure and gating logic with both real and dummy data
  • Confirmed correct logging and fallback behavior for unknown/missing data

Test result: CI pipeline ID - 1066532

2026-03-20T21:48:12.1062715Z INFO     root:controlplane_gating.py:86 Gating failure: LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1066143Z INFO     root:controlplane_gating.py:86 Gating failure: BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1069915Z ERROR    tests.common.fixtures.advanced_reboot:advanced_reboot.py:690 Post reboot verification failed. List of failures: LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2026-03-20T21:48:12.1073646Z BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511
2
2026-03-20T21:48:49.5327730Z E       [('test_upgrade_path[202505-to-202511-warm-strtk5-DUT]None', ['LACP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-HwSKU 202505->202511', 'BGP session recovery 220s exceeded allowed threshold (AVG + wiggle room): 30.00s + 10.00s for Arista-```

#### Any platform specific information?
No platform-specific logic; all gating is data-driven.

#### Supported testbed topology if it's a new test case?
Not applicable.

### Documentation

- Updated documentation to include the new JSON format and gating logic
- Provided a privacy-safe dummy JSON example for public reference

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wsycqyz
Copy link
Contributor

wsycqyz commented Mar 19, 2026

Hi @ravaliyel, does the 202511 branch need this change as well?

@StormLiangMS
Copy link
Collaborator

Code Review

🔴 Critical Issues

1. raise SystemExit(...) will kill the entire pytest process (line 70)

raise SystemExit(f"{label} threshold exceeded! Failing pipeline.")

SystemExit terminates the whole pytest run — no cleanup, no other tests, no report generation. This should be:

pytest.fail(f"{label} threshold exceeded: {val:.2f}s > {avg:.2f}s + {wiggle:.2f}s")

or append to gating_failures and return the list, letting the caller decide how to fail.

2. JSON example keys don't match code logic

The code looks up thresholds[hwsku][base][target]["LACP"]["AVG"] and ["BGP"]["AVG"], but hwsku_session_thresholds.json uses "protocol_1" and "protocol_2". Anyone using the example as a template will get silent "No thresholds found" warnings. The example should use "LACP" and "BGP" keys.

3. Missing return [] after warning on lines 55-58

if lacp_avg is None or bgp_avg is None:
    logging.warning(...)
    # falls through to checks loop — should return [] here

After logging the warning for missing thresholds, execution falls through to the checks loop. The loop won't fail (because avg is None fails the and condition), but it's confusing control flow. Add return [] after the warning.

✅ Positive

  • Clean modular design with data-driven thresholds
  • Good use of _extract_version() regex for flexible version parsing
  • Proper logging for unknown HwSKUs

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@ravaliyel
Copy link
Contributor Author

Hi @StormLiangMS, thank you for the review and suggestions. I have made the changes accordingly to use pytest.fail, added return and changed the hwsku json to use LACP and BGP. Please review, thank you

@ravaliyel
Copy link
Contributor Author

Hi @ravaliyel, does the 202511 branch need this change as well?

Hi @wsycqyz, yes this change will be applied for 202511 branch as well. Thank you

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
@ravaliyel ravaliyel force-pushed the ryeluri/control-plane-gating branch from 6657e7d to 7dd150d Compare March 20, 2026 20:26
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca yxieca requested a review from vaibhavhd March 20, 2026 20:37
Copy link
Collaborator

@yxieca yxieca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

Issues

1. pytest.fail() inside the loop short-circuits on first failure

for label, val, avg, wiggle in checks:
    if val is not None and avg is not None and val > (avg + wiggle):
        gating_failures.append(...)
        pytest.fail(...)  # stops here, never checks remaining items
return gating_failures  # dead code if pytest.fail() is reached

The function builds a gating_failures list and returns it, but pytest.fail() raises immediately, so you'll never see more than one failure and the return value is never used by the caller. Either:

  • Remove pytest.fail() from the loop and let the caller handle the list, or
  • Collect all failures first, then call pytest.fail() once at the end with a combined message

2. Return value is unused by the caller

In device_utils.py:

controlplane_gating(gating_input)  # return value discarded

If the intent is to fail via pytest.fail(), the return value is pointless. If the intent is to return failures for the caller to handle, then pytest.fail() shouldn't be inside the function. Pick one pattern.

3. Hardcoded 10s wiggle room may not suit all platforms

LACP_WIGGLE_ROOM = 10.0 and BGP_WIGGLE_ROOM = 10.0 are constants. For platforms with very fast recovery (e.g., AVG=20s), 10s is a 50% margin. For slow ones (AVG=210s), it's <5%. Consider making this a percentage, or moving it into the JSON thresholds per-HwSKU.

4. P95 and MAX thresholds in JSON are unused

The JSON stores AVG, P95, and MAX but only AVG is used. If P95/MAX aren't planned for use, they add confusion. If they are planned, worth noting in a comment.

5. f-strings require Python 3.6+

The codebase generally uses .format() for broader compatibility. Not a blocker but worth noting for consistency.

Minor

  • import pytest in a utility module creates a hard dependency — if this module is ever imported outside pytest context it will fail
  • The function name controlplane_gating matches the module name, which can cause confusion with imports

@vaibhavhd vaibhavhd changed the title Adding control plane gating logic [warm upgrade] Catch the regression before it becomes a problem - mine and police the protocol convergence timings Mar 20, 2026
@vaibhavhd
Copy link
Contributor

  • Collect all failures first, then call pytest.fail() once at the end with a combined message

@ravaliyel , comment 1 and 2 seem valid. Can you fix them? I think it is better to Collect all failures first, then call pytest.fail() once at the end with a combined message.

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot requested a review from yxieca March 20, 2026 23:08
@ravaliyel
Copy link
Contributor Author

Hi @yxieca, thank you for the review. I have addressed the comments 1 and 2, modified the code to collect all the gating failures, return and append it to the existing verification_errors list.
I have also removed the dependency of f-string and pytest from the gating logic.

@vaibhavhd mentioned that comments 3 and 4 can be addressed in the future PRs as extensions to this gating logic. Thank you.

@ravaliyel
Copy link
Contributor Author

  • Collect all failures first, then call pytest.fail() once at the end with a combined message

@ravaliyel , comment 1 and 2 seem valid. Can you fix them? I think it is better to Collect all failures first, then call pytest.fail() once at the end with a combined message.

@vaibhavhd I have implemented and updated the logic to collect all failures and the fail with a combined message. Added the test results to the description. Please review, thank you.

@vaibhavhd vaibhavhd merged commit 8c54bb1 into sonic-net:master Mar 23, 2026
17 checks passed
ravaliyel added a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 27, 2026
…e and police the protocol convergence timings (sonic-net#23114)

This PR refactors and improves the control plane session recovery gating logic during SONiC image upgrades. The logic is now modular, data-driven, and privacy-compliant, with all thresholds managed in a dedicated JSON file. Debug and legacy code are removed, and a minimal dummy data example is provided for public documentation.

How did you do it?
Centralized all control plane session recovery gating logic in a helper file
Created a structured JSON file for thresholds, supporting multiple HwSKUs and version pairs
Removed all debug and legacy code
Added logging for unknown HwSKUs and missing thresholds
Provided a minimal dummy JSON for public PRs
How did you verify/test it?
Unit and integration tested on multiple HwSKUs and version pairs
Validated JSON structure and gating logic with both real and dummy data
Confirmed correct logging and fallback behavior for unknown/missing data
Test result: CI pipeline ID - 1066532

Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants