Skip to content

Wait for ASIC_DB confirmation after disabling LAG members#22849

Open
yxieca wants to merge 6 commits intosonic-net:masterfrom
yxieca:fix/lag-member-await-config
Open

Wait for ASIC_DB confirmation after disabling LAG members#22849
yxieca wants to merge 6 commits intosonic-net:masterfrom
yxieca:fix/lag-member-await-config

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Mar 10, 2026

Description of PR

Summary:
After swssconfig disables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race causes intermittent test failures in test_lag_member_forwarding_packets because traffic verification runs before hardware actually disables the LAG members (~280ms gap observed).

Added a wait_until poll on ASIC_DB to confirm all LAG members have SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=true before proceeding with traffic tests.

Fixes #17095

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Approach

What is the motivation for this PR?

swssconfig returns immediately after writing to APP_DB, but orchagent has not yet processed the request and applied it via syncd to ASIC_DB/hardware. The test sends traffic immediately after swssconfig returns, hitting a race window where LAG members are still active. This causes test_lag_member_forwarding_packets to fail intermittently on hardware platforms.

How did you do it?

Added a wait_until(10, 0.5, 0, ...) poll after swssconfig that checks ASIC_DB for SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=true on all LAG member entries. This is the definitive signal that orchagent→syncd has fully applied the disable to hardware. The 0.5s poll interval with 10s timeout is generous — real-world logs show the gap is ~280ms.

How did you verify/test it?

  • Code review of the race condition timing from issue logs
  • flake8 lint pass (max-line-length=120)
  • The fix uses the same wait_until pattern used extensively throughout sonic-mgmt

Any platform specific information?

The VS (virtual switch) platform stores EGRESS_DISABLE in ASIC_DB but does not enforce it in the kernel dataplane — the existing VS skip block runs after this new wait, so VS tests still skip traffic verification as before.

Supported testbed topology if it's a new test case?

N/A (bug fix)

Documentation

N/A

After swssconfig disables LAG members, the command returns before
orchagent/syncd finishes applying the configuration to ASIC_DB.
This race condition causes subsequent traffic verification to fail
intermittently because packets are still forwarded through the LAG
members that haven't been disabled yet in hardware.

Add a wait_until poll on ASIC_DB to confirm all LAG members have
SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE set to true before proceeding
with traffic tests.

Fixes sonic-net#17095

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@yxieca
Copy link
Collaborator Author

yxieca commented Mar 10, 2026

This PR was raised by an AI agent on behalf of Ying Xie.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in
ASIC_DB, causing the wait_until check to timeout. Move the VS
early-return before the ASIC_DB poll since VS doesn't enforce
LAG member disable in the dataplane anyway.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca
Copy link
Collaborator Author

yxieca commented Mar 10, 2026

Local KVM Test Results (commit 0a01830)

Test: pc/test_lag_member_forwarding.py on KVM T0 (vms-kvm-t0, vlab-01)

1 passed in 239.26s (0:03:59)

Fix in this commit: Moved VS early-return before the ASIC_DB wait — VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in ASIC_DB, so the wait_until check was timing out. On real hardware, the ASIC_DB poll confirms config is applied before traffic verification.

The previous check collected all LAG members across all LAGs and
verified EGRESS_DISABLE=true on all of them. This would always fail
when other LAGs exist that were not disabled.

Fix: look up the SAI OID for the specific PortChannel under test via
COUNTERS_LAG_NAME_MAP, then filter LAG_MEMBER entries to only those
belonging to that LAG.

Also move the ASIC_DB check before the VS skip so it runs on VS too
(VS SAI does populate ASIC_DB, it just doesn't enforce disable in
the dataplane).

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

The fallback checked all LAG members across all LAGs, which is the
bug we are fixing. Fail explicitly instead.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Instead of blindly picking the first two PortChannels, iterate and
find a pair where all BGP neighbors (v4 and v6) are established.
Skip the test if two suitable PortChannels cannot be found.

This avoids failures on testbeds where some PortChannels have
permanently idle IPv6 sessions.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

COUNTERS_LAG_NAME_MAP may not have entries for all PortChannels on
converged-peer testbeds. Instead, look up the SAI OIDs of the member
ports via COUNTERS_PORT_NAME_MAP and match them against
SAI_LAG_MEMBER_ATTR_PORT_ID in ASIC_DB. This works regardless of
whether the LAG itself has a counter entry.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@yxieca-admin
Copy link

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Test Gap][LAG][t0-64] Applying lag members configuration is not awaited

3 participants