Skip to content

[action] [PR:21523] Feature/route programming data#974

Merged
mssonicbld merged 1 commit intoAzure:202412from
mssonicbld:cherry/msft-202412/21523
Jan 23, 2026
Merged

[action] [PR:21523] Feature/route programming data#974
mssonicbld merged 1 commit intoAzure:202412from
mssonicbld:cherry/msft-202412/21523

Conversation

@mssonicbld
Copy link
Copy Markdown
Collaborator

Description of PR

Summary:
Fixes # (issue)
This PR adds test cases and supporting utilities to measure route programming time under high‑scale BGP IPv6 scenarios, building on the refactoring in PR #21335. It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time.

Key Points:

  • Introduces new tests that:
    • Trigger BGP route updates (via scale, convergence, and flap scenarios) and
    • Measure end‑to‑end route programming time from control‑plane event to data‑plane readiness.
  • Adds helper logic to:
    • Capture timestamps, events counts around BGP state/route changes and data‑plane verification,
    • Collect and aggregate per‑iteration route programming statistics,
    • Expose pass/fail criteria based on configured thresholds.
  • Reuses the generalized connection and flap mechanisms introduced in PR #21335 and to generate realistic route programming events at scale.
  • Enhances logging so that per‑iteration programming times are visible for debugging and trend analysis.
  • Keeps tests parameter‑driven to allow different scales, iteration counts, and thresholds without duplicating logic.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • [ x ] Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

After adding flexible convergence and flap coverage, we also need quantitative visibility into route programming performance. Specifically:

  • How long does it take for routes to be fully programmed after:
    • Initial BGP convergence under scale,
    • BGP admin flaps or port flaps,
    • Large‑scale route updates?
  • Number of the NextHopGroup and Route Events
  • Do these programming times remain within agreed targets across iterations and scales?
  • Can we detect regressions or platform‑specific performance issues in route programming latency?

How did you do it?

  • Added new route‑programming‑time test flows (in tests/bgp/test_ipv6_bgp_scale.py) that:
    • Use existing mechanisms to:
      • Establish BGP sessions and advertise a high scale of IPv6 routes.
      • Trigger events (e.g., convergence, admin flap cycles) using the refactored connection APIs.
    • Around each event, record:
      • The time when the control‑plane action is taken (e.g., admin shut/no‑shut, route announcement),
      • The time when verification confirms that routes are fully programmed and forwarding is working.
    • Compute route programming time as the delta between these timestamps.
    • Get the number of NextHopGroup and Route Events
  • Implemented helper functions to:
    • Configure thresholds for acceptable route programming time,
    • Aggregate and report per‑iteration statistics,
    • Mark the test as failed if the measured times exceed configured limits.
  • Reused logging and verification utilities from the refactored BGP test infrastructure so that:
    • The tests share as much code as possible with the convergence and flap tests,
    • Route programming time measurements can be correlated with existing convergence logs.

How did you verify/test it?

  • Executed the new route programming time tests in the same environment used for the high‑bgp and flap tests:
    • Topology: t0-isolated-d2u510s2
    • Platform: Broadcom Arista-7060X6-64PE-B-C512S2
  • Verified that:
    • Route programming times are collected and logged for each iteration,
    • Measured values are stable and within expected thresholds under high scale,
    • No unexpected failures or long‑tail outliers appear during repeated runs.

Any platform specific information?

  • Verified on Broadcom Arista-7060X6-64PE-B-C512S2 (or the platform(s) you actually used).
  • The test logic itself is platform‑agnostic, but measured route programming times are naturally platform and scale dependent.

Supported testbed topology if it's a new test case?

Documentation

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
This PR adds test cases and supporting utilities to measure **route programming time** under high‑scale BGP IPv6 scenarios, building on the refactoring in [PR #21335](sonic-net/sonic-mgmt#21335). It focuses on quantifying how long it takes for routes to be fully programmed after BGP/connection events (e.g., convergence, admin flaps), and verifying that the measured RP times stay within expected limits and similar to the convergence time.

Key Points:

- Introduces new tests that:
  - Trigger BGP route updates (via scale, convergence, and flap scenarios) and
  - Measure end‑to‑end route programming time from control‑plane event to data‑plane readiness.
- Adds helper logic to:
  - Capture timestamps, events counts around BGP state/route changes and data‑plane verification,
  - Collect and aggregate per‑iteration route programming statistics,
  - Expose pass/fail criteria based on configured thresholds.
- Reuses the generalized connection and flap mechanisms introduced in PR #21335 and to generate realistic route programming events at scale.
- Enhances logging so that per‑iteration programming times are visible for debugging and trend analysis.
- Keeps tests parameter‑driven to allow different scales, iteration counts, and thresholds without duplicating logic.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
After adding flexible convergence and flap coverage, we also need **quantitative** visibility into route programming performance. Specifically:

- How long does it take for routes to be fully programmed after:
  - Initial BGP convergence under scale,
  - BGP admin flaps or port flaps,
  - Large‑scale route updates?
- Number of the NextHopGroup and Route Events
- Do these programming times remain within agreed targets across iterations and scales?
- Can we detect regressions or platform‑specific performance issues in route programming latency?

#### How did you do it?

- Added new route‑programming‑time test flows (in `tests/bgp/test_ipv6_bgp_scale.py`) that:
  - Use existing mechanisms to:
    - Establish BGP sessions and advertise a high scale of IPv6 routes.
    - Trigger events (e.g., convergence, admin flap cycles) using the refactored connection APIs.
  - Around each event, record:
    - The time when the control‑plane action is taken (e.g., admin shut/no‑shut, route announcement),
    - The time when verification confirms that routes are fully programmed and forwarding is working.
  - Compute route programming time as the delta between these timestamps.
  - Get the number of NextHopGroup and Route Events
- Implemented helper functions to:
  - Configure thresholds for acceptable route programming time,
  - Aggregate and report per‑iteration statistics,
  - Mark the test as failed if the measured times exceed configured limits.
- Reused logging and verification utilities from the refactored BGP test infrastructure so that:
  - The tests share as much code as possible with the convergence and flap tests,
  - Route programming time measurements can be correlated with existing convergence logs.

#### How did you verify/test it?

- Executed the new route programming time tests in the same environment used for the high‑bgp and flap tests:
  - Topology: `t0-isolated-d2u510s2`
  - Platform: Broadcom Arista-7060X6-64PE-B-C512S2
- Verified that:
  - Route programming times are collected and logged for each iteration,
  - Measured values are stable and within expected thresholds under high scale,
  - No unexpected failures or long‑tail outliers appear during repeated runs.

#### Any platform specific information?

- Verified on Broadcom Arista-7060X6-64PE-B-C512S2 (or the platform(s) you actually used).
- The test logic itself is platform‑agnostic, but measured route programming times are naturally platform and scale dependent.

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
@mssonicbld
Copy link
Copy Markdown
Collaborator Author

Original PR: sonic-net/sonic-mgmt#21523

@mssonicbld
Copy link
Copy Markdown
Collaborator Author

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld mssonicbld merged commit 51b0525 into Azure:202412 Jan 23, 2026
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant