Skip to content

[teamsyncd] remove m_stateLagTablePreserved#3782

Closed
stepanblyschak wants to merge 10 commits intosonic-net:masterfrom
stepanblyschak:lag-state-immediate-set
Closed

[teamsyncd] remove m_stateLagTablePreserved#3782
stepanblyschak wants to merge 10 commits intosonic-net:masterfrom
stepanblyschak:lag-state-immediate-set

Conversation

@stepanblyschak
Copy link
Copy Markdown
Contributor

@stepanblyschak stepanblyschak commented Jul 23, 2025

Once LAG is created in the kernel set its state to STATE_DB immediately. STATE_DB flag triggers kernel LAG IP configuration, which does not need to be delayed.

Only LAG experience this behaviour, any other type of interface gets kernel IP configuration as soon as netdev is created.

What I did

Remove m_stateLagTablePreserved.

Why I did it

To not delay kernel LAG IP configuration.

Before:

2025 Jul 22 11:49:57.368571 mtvr-bison-133 NOTICE teamd#teamsyncd: :- checkWarmStart: teamsyncd doing warm start, restore count 1
2025 Jul 22 11:49:57.369866 mtvr-bison-133 INFO teamd#teamsyncd: :- onMsg:  nlmsg type:16 key:PortChannel0001 admin:1 oper:1 ifindex:4 type:team
2025 Jul 22 11:51:07.531605 mtvr-bison-133 DEBUG swss#intfmgrd: :- doIntfAddrTask: Interface is not ready, skipping PortChannel0001
2025 Jul 22 11:51:08.685359 mtvr-bison-133 NOTICE teamd#teamsyncd: :- applyState: Applying state
2025 Jul 22 11:51:10.611619 mtvr-bison-133 DEBUG swss#intfmgrd: :- exec: /sbin/ip address "add" "30.0.0.1/24" broadcast "30.0.0.255" dev "PortChannel0001" :  : Exited with rc=0

After:

2025 Jul 23 10:03:32.445081 mtvr-bison-133 NOTICE teamd#teamsyncd: :- checkWarmStart: teamsyncd doing warm start, restore count 1
2025 Jul 23 10:03:32.445879 mtvr-bison-133 INFO teamd#teamsyncd: :- onMsg:  nlmsg type:16 key:PortChannel0001 admin:1 oper:1 ifindex:4 type:team
2025 Jul 23 10:03:37.237107 mtvr-bison-133 DEBUG swss#intfmgrd: :- exec: /sbin/ip address "add" "30.0.0.1/24" broadcast "30.0.0.255" dev "PortChannel0001" :  : Exited with rc=0
2025 Jul 23 10:04:44.369749 mtvr-bison-133 NOTICE teamd#teamsyncd: :- applyState: Applying state

How I verified it

Ran following tests on Mellanox-SN4600C-C64 (master.0-e7331e3cb_Internal):

  • warm-reboot
  • warm-reboot-sad
  • warm-reboot-multi-sad-lag
  • warm-reboot-sad-lag-member
  • fast-reboot

Merge after - sonic-net/sonic-buildimage#24174

Details if related

Once LAG is created in the kernel set its state to STATE_DB immediately.
STATE_DB flag triggers kernel LAG IP configuration, which does not need
to be delayed.

Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
Signed-off-by: Stepan Blyschak <stepanb@nvidia.com>
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak stepanblyschak changed the title Lag state immediate set [teamsyncd] remove m_stateLagTablePreserved Jul 23, 2025
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny
Copy link
Copy Markdown
Collaborator

prsunny commented Jul 23, 2025

@stepanblyschak , can you point to the PR that added this feature. IIRC, this was done for some WB corner cases and seems to be an impactful change. @vaibhavhd for viz

@vaibhavhd
Copy link
Copy Markdown
Contributor

@stepanblyschak is the lack of this change the reason that portchannels are coming as OPER DOWN after warm-reboot to 202505?

We are trying to triage that issue, and we have reasons to suspect that the regression started after this PR #3563

Issue - sonic-net/sonic-buildimage#23347

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

@prsunny Since day 1 - #725

What is that corner case?

I ran all relevant tests I know:

  • warm-reboot
  • warm-reboot-sad
  • warm-reboot-multi-sad-lag
  • warm-reboot-sad-lag-member
  • fast-reboot

Sad path LAG cases should cover teamsyncd reconciliation flow.
If you know tests cases I missed please let me know I will run.

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

@stepanblyschak is the lack of this change the reason that portchannels are coming as OPER DOWN after warm-reboot to 202505?

We are trying to triage that issue, and we have reasons to suspect that the regression started after this PR #3563

Issue - sonic-net/sonic-buildimage#23347

@vaibhavhd no it's different, I see what's the issue, working on it.

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

/azpw run

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@prsunny prsunny requested a review from saiarcot895 August 11, 2025 21:49
@stepanblyschak
Copy link
Copy Markdown
Contributor Author

During continuous testing of warm-reboot it was found that this delay is intended to postpone establishment of BGP sessions with upstream devices. Otherwise, if we have downstream BGP sessions which are not yet established and the device sends incomplete routing table and then sends EOIU marker to upstream device, causing upstream device to delete routes. Ideally, could be solved by postponing EOIU from FRR to not delay kernel config.

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak stepanblyschak marked this pull request as draft August 13, 2025 08:11
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

PR in draft due to FRR issue found causing the above

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@saiarcot895
Copy link
Copy Markdown
Contributor

During continuous testing of warm-reboot it was found that this delay is intended to postpone establishment of BGP sessions with upstream devices. Otherwise, if we have downstream BGP sessions which are not yet established and the device sends incomplete routing table and then sends EOIU marker to upstream device, causing upstream device to delete routes. Ideally, could be solved by postponing EOIU from FRR to not delay kernel config.

Wouldn't this mean that if we remove this delay, and the upstream LAGs have an IP address assigned first, and BGP gets established first with upstream devices, then that'll result in a deletion of many routes?

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

@saiarcot895 Issue described in the comment you refer to is solved by sonic-net/sonic-buildimage#24174.
Once that PR is in, this PR will be set Ready for review

@liat-grozovik
Copy link
Copy Markdown
Collaborator

@yxieca please help to assign someone to review the suggested improvements in fastboot. this is req for SKUs which have more and more ports but will definitely help with existing SKUs as well.
The change is expected to be part of 202511 thus need your review with prio.

@yxieca
Copy link
Copy Markdown
Contributor

yxieca commented Nov 9, 2025

@stepanblyschak , @liat-grozovik , Both Vaibhav and Saikrishna have already engaged in reviewing. However, this PR is still in DRAFT mode.

What can make it ready for review?

@stepanblyschak
Copy link
Copy Markdown
Contributor Author

@yxieca This PR depends on a merge of a bug fix in FRR sonic-net/sonic-buildimage#24174

@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants