Skip to content

Added LAG member check on addLagMember() (#2464)#2665

Closed
yenlu-keith wants to merge 1 commit intosonic-net:202205from
yenlu-keith:local-swss-202205
Closed

Added LAG member check on addLagMember() (#2464)#2665
yenlu-keith wants to merge 1 commit intosonic-net:202205from
yenlu-keith:local-swss-202205

Conversation

@yenlu-keith
Copy link

What I did
Added a check into addLagMember() whether this new LAG member still exists in the kernel.

Why I did it
During syncd container autorestart scenario, on syncd exit, the host interfaces (tun/tap netdevs) go to the DOWN state and then get removed.

Due to the validation as follows, the teammgr will receive the notification about the port state change (the information will be updated in the state DB and pubsub message sent) but the port state record will not be removed from the state DB on port delete:

sonic-swss/portsyncd/linksync.cpp
Line 210 in 7cc035f
if (master && nlmsg_type == RTM_DELLINK)
Due to this, on port state change notification, the isPortStateOk() will succeed and TeamMgr::addLagMember() will be executed even the host interface was actually removed.

The operation is expected to be ignored if the port is already enslaved:

sonic-swss/cfgmgr/teammgr.cpp
Line 721 in 7cc035f
if (isPortEnslaved(member))
The check fails since the port has already been removed:

sonic-swss/cfgmgr/teammgr.cpp
Line 412 in 7cc035f
return lstat(path.c_str(), &buf) == 0;
As a result, the TeamMgr::addLagMember() logic will be executed and failed:

Jun 21 11:47:12.265955 cab18-7-dut INFO teamd#/supervisord: teammgrd Cannot find device "Ethernet0"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.294550 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message received: "NoSuchDev"
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd libteamdctl: cli_usock_process_msg: usock: Error message content: "No such device."
Jun 21 11:47:12.322497 cab18-7-dut INFO teamd#/supervisord: teammgrd command call failed (Invalid argument)
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- checkPortIffUp: Failed to get port Ethernet0 flags
Jun 21 11:47:12.328844 cab18-7-dut ERR teamd#teammgrd: :- addLagMember: Failed to add Ethernet0 to port channel PortChannel102
The issue started to reproduce after #2233

How I verified it

autorestart/test_container_autorestart.py -k 'syncd'

*[teammgr] Added LAG member check into addLagMember()
@prsunny
Copy link
Collaborator

prsunny commented Feb 15, 2023

Cherry-picking PR #2464

@prsunny
Copy link
Collaborator

prsunny commented Feb 15, 2023

@yenlu-keith , i've added labels to original PR. Branch owner shall get back if there are any conflicts and require manual PR. Until then we can hold off on this PR.

@prsunny
Copy link
Collaborator

prsunny commented Feb 22, 2023

@yenlu-keith , closing this PR as it is already included as part of cherry-picking original PR. Please check the labels

@prsunny prsunny closed this Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants