Wait for ASIC_DB confirmation after disabling LAG members#22849
Open
yxieca wants to merge 6 commits intosonic-net:masterfrom
Open
Wait for ASIC_DB confirmation after disabling LAG members#22849yxieca wants to merge 6 commits intosonic-net:masterfrom
yxieca wants to merge 6 commits intosonic-net:masterfrom
Conversation
After swssconfig disables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race condition causes subsequent traffic verification to fail intermittently because packets are still forwarded through the LAG members that haven't been disabled yet in hardware. Add a wait_until poll on ASIC_DB to confirm all LAG members have SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE set to true before proceeding with traffic tests. Fixes sonic-net#17095 Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
Author
|
This PR was raised by an AI agent on behalf of Ying Xie. |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
12 tasks
VS SAI does not populate SAI_LAG_MEMBER_ATTR_EGRESS_DISABLE in ASIC_DB, causing the wait_until check to timeout. Move the VS early-return before the ASIC_DB poll since VS doesn't enforce LAG member disable in the dataplane anyway. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
Author
Local KVM Test Results (commit 0a01830)Test: Fix in this commit: Moved VS early-return before the ASIC_DB wait — VS SAI does not populate |
The previous check collected all LAG members across all LAGs and verified EGRESS_DISABLE=true on all of them. This would always fail when other LAGs exist that were not disabled. Fix: look up the SAI OID for the specific PortChannel under test via COUNTERS_LAG_NAME_MAP, then filter LAG_MEMBER entries to only those belonging to that LAG. Also move the ASIC_DB check before the VS skip so it runs on VS too (VS SAI does populate ASIC_DB, it just doesn't enforce disable in the dataplane). Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
The fallback checked all LAG members across all LAGs, which is the bug we are fixing. Fail explicitly instead. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Instead of blindly picking the first two PortChannels, iterate and find a pair where all BGP neighbors (v4 and v6) are established. Skip the test if two suitable PortChannels cannot be found. This avoids failures on testbeds where some PortChannels have permanently idle IPv6 sessions. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
COUNTERS_LAG_NAME_MAP may not have entries for all PortChannels on converged-peer testbeds. Instead, look up the SAI OIDs of the member ports via COUNTERS_PORT_NAME_MAP and match them against SAI_LAG_MEMBER_ATTR_PORT_ID in ASIC_DB. This works regardless of whether the LAG itself has a counter entry. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
Summary:
After
swssconfigdisables LAG members, the command returns before orchagent/syncd finishes applying the configuration to ASIC_DB. This race causes intermittent test failures intest_lag_member_forwarding_packetsbecause traffic verification runs before hardware actually disables the LAG members (~280ms gap observed).Added a
wait_untilpoll on ASIC_DB to confirm all LAG members haveSAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=truebefore proceeding with traffic tests.Fixes #17095
Type of change
Back port request
Approach
What is the motivation for this PR?
swssconfigreturns immediately after writing to APP_DB, but orchagent has not yet processed the request and applied it via syncd to ASIC_DB/hardware. The test sends traffic immediately afterswssconfigreturns, hitting a race window where LAG members are still active. This causestest_lag_member_forwarding_packetsto fail intermittently on hardware platforms.How did you do it?
Added a
wait_until(10, 0.5, 0, ...)poll afterswssconfigthat checks ASIC_DB forSAI_LAG_MEMBER_ATTR_EGRESS_DISABLE=trueon all LAG member entries. This is the definitive signal that orchagent→syncd has fully applied the disable to hardware. The 0.5s poll interval with 10s timeout is generous — real-world logs show the gap is ~280ms.How did you verify/test it?
wait_untilpattern used extensively throughout sonic-mgmtAny platform specific information?
The VS (virtual switch) platform stores EGRESS_DISABLE in ASIC_DB but does not enforce it in the kernel dataplane — the existing VS skip block runs after this new wait, so VS tests still skip traffic verification as before.
Supported testbed topology if it's a new test case?
N/A (bug fix)
Documentation
N/A