Fix BGP scale test reliability: route validation, pipeline drain, and… by rejithomas-arista · Pull Request #23209 · sonic-net/sonic-mgmt

rejithomas-arista · 2026-03-23T10:31:05Z

Description of PR

Summary:
Fix three reliability issues in test_ipv6_bgp_scale and announce_routes that cause false failures on large-scale topologies.

Type of change

Back port request

Approach

What is the motivation for this PR?

BGP scale tests fail intermittently on large topologies due to three independent issues:

Route count mismatch: fib_t1_lag() overwrites topo_routes[k] using = for BGP_SCALE_T1S topologies, discarding shared ECMP routes from the earlier block
False packet loss: calculate_downtime() uses a fixed 10s sleep before reading rx counters, but the nn_agent pipeline can take 12-17s to drain on loaded systems
Silent traffic thread crash: send_packets() doesn't handle nanomsg timeout exceptions — under CPU starvation (50+ load on 12 cores with 1024 exaBGP processes), nn_agent can't
drain its socket fast enough

How did you do it?

Changed topo_routes[k][IPV4] = routes_v4 to topo_routes[k].get(IPV4, []) + routes_v4 to append instead of overwrite (same for IPv6)
Replaced fixed time.sleep(MASK_COUNTER_WAIT_TIME) with wait_for_rx_quiescence() that polls until rx counters stabilize for 10 consecutive seconds
Added _send_with_retry() wrapper that catches nanomsg timeout and retries up to 10 times with exponential backoff (0.1s base, capped at 1.6s). Also logs retry failure count.

How did you verify/test it?

Ran BGP scale tests on VPP t1-lag topology (32 peers, 512 ports). Without fixes: intermittent failures with truncated TX counts and incorrect downtime calculations. With fixes:
consistent pass.

Any platform specific information?

Issue 1 (route overwrite) affects any platform using BGP_SCALE_T1S topologies. Issues 2 and 3 are more pronounced on VPP and other platforms with slower route programming or high CPU
load.

Supported testbed topology if it's a new test case?

N/A — existing test fixes. Tested on t1-lag.

Documentation

N/A

… send retry Three fixes for test_ipv6_bgp_scale and related BGP scale tests: 1. Fix announce_routes topo_routes overwrite: In fib_t1_lag(), for BGP_SCALE_T1S topologies, the main loop overwrites topo_routes[k] using '=' which discards shared ECMP routes from the BGP_SCALE_T1S block. Use append instead to match routes_to_change behavior. 2. Fix calculate_downtime pipeline drain: Replace fixed 10s sleep with wait_for_rx_quiescence() that polls until counters stabilize. Also remove incorrect [:-1] slice on rx values that drops a valid port. 3. Fix send_packets nanomsg timeout: Add _send_with_retry() to handle NNError timeout during CPU starvation on large topologies (512 ports, 1024 exaBGP processes). Retry up to 10 times with exponential backoff.

mssonicbld · 2026-03-23T10:31:12Z

/azp run

azure-pipelines · 2026-03-23T10:31:25Z

Azure Pipelines successfully started running 1 pipeline(s).

github-actions bot requested review from cyw233, sanjair-git and wangxin March 23, 2026 10:31

Copilot AI mentioned this pull request Mar 24, 2026

Review: human-authored PRs opened in past 48 hours (24 PRs) #23231

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix BGP scale test reliability: route validation, pipeline drain, and…#23209

Fix BGP scale test reliability: route validation, pipeline drain, and…#23209
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-scale

rejithomas-arista commented Mar 23, 2026

Uh oh!

mssonicbld commented Mar 23, 2026

Uh oh!

azure-pipelines bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rejithomas-arista commented Mar 23, 2026

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Mar 23, 2026

Uh oh!

azure-pipelines bot commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants