Fix BGP scale test reliability: route validation, pipeline drain, and…#23209
Open
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
Open
Fix BGP scale test reliability: route validation, pipeline drain, and…#23209rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
rejithomas-arista wants to merge 1 commit intosonic-net:masterfrom
Conversation
… send retry Three fixes for test_ipv6_bgp_scale and related BGP scale tests: 1. Fix announce_routes topo_routes overwrite: In fib_t1_lag(), for BGP_SCALE_T1S topologies, the main loop overwrites topo_routes[k] using '=' which discards shared ECMP routes from the BGP_SCALE_T1S block. Use append instead to match routes_to_change behavior. 2. Fix calculate_downtime pipeline drain: Replace fixed 10s sleep with wait_for_rx_quiescence() that polls until counters stabilize. Also remove incorrect [:-1] slice on rx values that drops a valid port. 3. Fix send_packets nanomsg timeout: Add _send_with_retry() to handle NNError timeout during CPU starvation on large topologies (512 ports, 1024 exaBGP processes). Retry up to 10 times with exponential backoff.
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
Fix three reliability issues in test_ipv6_bgp_scale and announce_routes that cause false failures on large-scale topologies.
Type of change
Back port request
Approach
What is the motivation for this PR?
BGP scale tests fail intermittently on large topologies due to three independent issues:
fib_t1_lag()overwritestopo_routes[k]using=for BGP_SCALE_T1S topologies, discarding shared ECMP routes from the earlier blockcalculate_downtime()uses a fixed 10s sleep before reading rx counters, but the nn_agent pipeline can take 12-17s to drain on loaded systemssend_packets()doesn't handle nanomsg timeout exceptions — under CPU starvation (50+ load on 12 cores with 1024 exaBGP processes),nn_agentcan'tdrain its socket fast enough
How did you do it?
topo_routes[k][IPV4] = routes_v4totopo_routes[k].get(IPV4, []) + routes_v4to append instead of overwrite (same for IPv6)time.sleep(MASK_COUNTER_WAIT_TIME)withwait_for_rx_quiescence()that polls until rx counters stabilize for 10 consecutive seconds_send_with_retry()wrapper that catches nanomsg timeout and retries up to 10 times with exponential backoff (0.1s base, capped at 1.6s). Also logs retry failure count.How did you verify/test it?
Ran BGP scale tests on VPP t1-lag topology (32 peers, 512 ports). Without fixes: intermittent failures with truncated TX counts and incorrect downtime calculations. With fixes:
consistent pass.
Any platform specific information?
Issue 1 (route overwrite) affects any platform using BGP_SCALE_T1S topologies. Issues 2 and 3 are more pronounced on VPP and other platforms with slower route programming or high CPU
load.
Supported testbed topology if it's a new test case?
N/A — existing test fixes. Tested on t1-lag.
Documentation
N/A