Skip to content

[bgp/agg]: Add BGP aggregate address test cases for Config Persistence and Recovery #23347

Open
shixizhang wants to merge 2 commits intosonic-net:masterfrom
shixizhang:addbgptest-pr
Open

[bgp/agg]: Add BGP aggregate address test cases for Config Persistence and Recovery #23347
shixizhang wants to merge 2 commits intosonic-net:masterfrom
shixizhang:addbgptest-pr

Conversation

@shixizhang
Copy link
Copy Markdown

Description of PR

Summary:
Add new test file test_bgp_aggregate_address_resilience.py (Test Group 5) that validates BGP aggregate-address configuration persistence and recovery across various disruption scenarios. These 5 new test cases verify that aggregate address configuration written via GCU survives BGP container restarts, config reloads, cold reboots, warm reboots, and BBR state transitions.

New test cases:

  • TC 5.1 test_aggregate_persists_bgp_container_restart: Aggregate config survives BGP container restart; CONFIG_DB + STATE_DB + FRR are consistent after recovery.
  • TC 5.2 test_aggregate_persists_config_reload: Aggregate config (with summary-only=true) survives config save + config reload.
  • TC 5.3 test_aggregate_persists_config_save_and_reboot: IPv6 aggregate config survives config save + cold reboot.
  • TC 5.4 test_aggregate_bbr_required_inactive_persists_bgp_restart: BBR-required aggregate stays inactive after BGP restart when BBR is disabled; activates once BBR is enabled.
  • TC 5.5 test_aggregate_persists_warm_reboot: Aggregate config survives warm reboot.

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

Existing BGP aggregate-address tests cover configuration validation and route propagation behavior, but there are no tests verifying that aggregate address configuration persists across operational disruptions such as BGP container restarts, config reloads, and device reboots. This PR fills that gap by adding resilience tests that validate CONFIG_DB, STATE_DB, and FRR running-config consistency after each disruption type.

How did you do it?

  • Created test_bgp_aggregate_address_resilience.py reusing existing helpers and fixtures from test_bgp_aggregate_address.py (AggregateCfg, gcu_add_aggregate, gcu_remove_aggregate, verify_bgp_aggregate_consistence, verify_bgp_aggregate_cleanup, dump_db, and the setup_teardown checkpoint/rollback fixture).
  • Added a bgp_neighbors fixture to discover BGP neighbor IPs for session-state polling after disruptions.
  • Pre-disruption verification only checks CONFIG_DB (GCU write is synchronous). Post-disruption verification checks the full stack (CONFIG_DB + STATE_DB + FRR) after bgpcfgd has re-processed the config.
  • Added wait_for_aggregate_state() helper to handle the asynchronous bgpcfgd STATE_DB population after disruptions.
  • All test cases include proper cleanup in finally blocks with graceful fallback to checkpoint rollback.

How did you verify/test it?

Ran all test cases on a physical m1-48 testbed with Arista EOS neighbors.
image

Any platform specific information?

No platform-specific dependencies. Tests use GCU for configuration and standard SONiC reboot/reload utilities, which are platform-agnostic.

Supported testbed topology if it's a new test case?

t1, m1 (declared via @pytest.mark.topology("t1", "m1"))

Documentation

Aligned with BGP-Aggregate-Address test plan

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@shixizhang shixizhang enabled auto-merge (squash) March 26, 2026 11:47
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Replace safe_reboot=True with explicit service and BGP session recovery,
matching the pattern from test_bgp_session.py.  safe_reboot=True calls
wait_critical_processes() which requires every process in every container
to be healthy — this fails on VS when fpmsyncd crashes during warm reboot.

The lighter approach:
  1. reboot() handles SSH reconnect and warmboot finalizer
  2. Explicit critical_services_fully_started wait (480s, 120s initial delay)
  3. Explicit BGP session wait

This avoids the hard wait_critical_processes check while still validating
that the DUT recovers enough for aggregate address verification.

Co-authored-by: Copilot <[email protected]>
Signed-off-by: Shixi Zhang <[email protected]>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants