Skip to content

retry the chassis db cleanup operations#24219

Merged
rlhui merged 1 commit intomasterfrom
dev/arlakshm/retry_cleanup
Oct 16, 2025
Merged

retry the chassis db cleanup operations#24219
rlhui merged 1 commit intomasterfrom
dev/arlakshm/retry_cleanup

Conversation

@arlakshm
Copy link
Contributor

@arlakshm arlakshm commented Oct 7, 2025

Why I did it

When running load_minigraph or reloading configuration on the linecards, the interface-config.service restarts, which causes the midplane interface to flap. If swss.sh on the linecards deletes state from chassis_db, some states may not be cleaned up correctly, while others are successfully removed. For example, cleanup for SYSTEM_NEIGHBOR or SYSTEM_INTF may fail, but SYSTEM_LAG cleanup might succeed. This can lead to inconsistent lag IDs for the remote LC.

Work item tracking
  • Microsoft ADO 35454463

How I did it

Add logic to retry in swss.sh script.

How to verify it

Run test to do load_minigraph on all the linecards and check for the logs to for remove lag failure for Lags on remote LC.

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@arlakshm arlakshm marked this pull request as ready for review October 7, 2025 23:32
@arlakshm arlakshm requested a review from lguohan as a code owner October 7, 2025 23:32
@rlhui rlhui added the P0 Priority of the issue label Oct 9, 2025
@arlakshm
Copy link
Contributor Author

@saksarav-nokia @ysmanman, @mlok-nokia, @arista-nwolfe, please help review...

@saksarav-nokia
Copy link
Contributor

saksarav-nokia commented Oct 14, 2025

@arlakshm , The code changes LGTM. However, i am unable to understand when we could hit this issue. If IMM can't reach the CHASSIS_APP_DB, then the following code waits till it is reachable right?
until [[ $($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0 ]]; do
sleep 1
done

Also we fixed something in this area while ago with PR https://github.com/sonic-net/sonic-buildimage/pull/18756

Copy link
Contributor

@deepak-singhal0408 deepak-singhal0408 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@arlakshm
Copy link
Contributor Author

@arlakshm , The code changes LGTM. However, i am unable to understand when we could hit this issue. If IMM can't reach the CHASSIS_APP_DB, then the following code waits till it is reachable right? until [[ $($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0 ]]; do sleep 1 done

Also we fixed something in this area while ago with PR https://github.com/sonic-net/sonic-buildimage/pull/18756

Hi @saksarav-nokia, during load_minigraph the networking service is restarted, which happens parallely so the midplane interface can flap after this code is executed $($SONIC_DB_CLI CHASSIS_APP_DB PING | grep -c True) -gt 0 ]]; do sleep 1 done. In this case some the clean up in the chassis_db is not done. This PR to fix this case

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202405: Azure/sonic-buildimage-msft#1730

FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Dec 4, 2025
When running load_minigraph or reloading configuration on the linecards, the interface-config.service restarts, which causes the midplane interface to flap. If swss.sh on the linecards deletes state from chassis_db, some states may not be cleaned up correctly, while others are successfully removed. For example, cleanup for SYSTEM_NEIGHBOR or SYSTEM_INTF may fail, but SYSTEM_LAG cleanup might succeed. This can lead to inconsistent lag IDs for the remote LC.

Signed-off-by: Arvindsrinivasan Lakshmi Narasimhan <arlakshm@microsoft.com>
Signed-off-by: Feng Pan <fenpan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

6 participants