Skip to content

Fix BGP scale test reliability: route validation, pipeline drain, and…#23209

Open
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-scale
Open

Fix BGP scale test reliability: route validation, pipeline drain, and…#23209
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista:master-fix-bgp-scale

Conversation

@rejithomas-arista
Copy link
Copy Markdown

Description of PR

Summary:
Fix three reliability issues in test_ipv6_bgp_scale and announce_routes that cause false failures on large-scale topologies.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

BGP scale tests fail intermittently on large topologies due to three independent issues:

  1. Route count mismatch: fib_t1_lag() overwrites topo_routes[k] using = for BGP_SCALE_T1S topologies, discarding shared ECMP routes from the earlier block
  2. False packet loss: calculate_downtime() uses a fixed 10s sleep before reading rx counters, but the nn_agent pipeline can take 12-17s to drain on loaded systems
  3. Silent traffic thread crash: send_packets() doesn't handle nanomsg timeout exceptions — under CPU starvation (50+ load on 12 cores with 1024 exaBGP processes), nn_agent can't
    drain its socket fast enough

How did you do it?

  1. Changed topo_routes[k][IPV4] = routes_v4 to topo_routes[k].get(IPV4, []) + routes_v4 to append instead of overwrite (same for IPv6)
  2. Replaced fixed time.sleep(MASK_COUNTER_WAIT_TIME) with wait_for_rx_quiescence() that polls until rx counters stabilize for 10 consecutive seconds
  3. Added _send_with_retry() wrapper that catches nanomsg timeout and retries up to 10 times with exponential backoff (0.1s base, capped at 1.6s). Also logs retry failure count.

How did you verify/test it?

Ran BGP scale tests on VPP t1-lag topology (32 peers, 512 ports). Without fixes: intermittent failures with truncated TX counts and incorrect downtime calculations. With fixes:
consistent pass.

Any platform specific information?

Issue 1 (route overwrite) affects any platform using BGP_SCALE_T1S topologies. Issues 2 and 3 are more pronounced on VPP and other platforms with slower route programming or high CPU
load.

Supported testbed topology if it's a new test case?

N/A — existing test fixes. Tested on t1-lag.

Documentation

N/A

… send retry

Three fixes for test_ipv6_bgp_scale and related BGP scale tests:

1. Fix announce_routes topo_routes overwrite: In fib_t1_lag(), for
   BGP_SCALE_T1S topologies, the main loop overwrites topo_routes[k]
   using '=' which discards shared ECMP routes from the BGP_SCALE_T1S
   block. Use append instead to match routes_to_change behavior.

2. Fix calculate_downtime pipeline drain: Replace fixed 10s sleep with
   wait_for_rx_quiescence() that polls until counters stabilize. Also
   remove incorrect [:-1] slice on rx values that drops a valid port.

3. Fix send_packets nanomsg timeout: Add _send_with_retry() to handle
   NNError timeout during CPU starvation on large topologies (512 ports,
   1024 exaBGP processes). Retry up to 10 times with exponential backoff.
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants