[libteam]: Reimplement Warm-Reboot procedure by pavel-shirshov · Pull Request #2999 · sonic-net/sonic-buildimage

pavel-shirshov · 2019-06-12T21:59:32Z

New implementation of teamd support of Warm-Reboot procedure

During the manual testing of the previous Warm-Reboot procedure implementation for teamd we found, that teamd restores state incorrectly, if one of the ports was put in OPER DOWN state during the procedure.
To fix that I redesigned the procedure completely:

When we prepare the system for the WR procedure, we save LACP PDU for every LAG member port (as before), and also additional information about current LAG:

number of LAG members
interface names of the LAG members
operational states of the LAG members

When we start the system in WR state, we read the saved LAG information, and this information allows us to restore the state correctly.

if LAG state before the reboot was down, we disable the WR mode immediately, we have to start in the normal mode
if LAG state before the reboot was up, we start with enabled LAG interface, to don't disrupt the dataplane and start restoring state of LAG members
when SONiC adds LAG interfaces, teamd runs lacp_update_carrier() function. This function is used to calculate LAG interface operational state. We keep the state up, while we're in WR mode. Every time when lacp_update_carrier() is executed, we check operational state of LAG member interfaces. If it's up, we read the LACP PDU from files, if it's not up we wait.
After we execute lacp_update_carrier() once, we start timer. If we weren't able to restore the state for more than 3 seconds, we stop WR mode and start working as usual.
As soon as we read all LAG member interfaces state, we can disable WR-mode and run teamd as usual

The WR start logic was completely moved to lacp_update_carrier().
I've added a lot of debug messages for WR mode, which will allow us to find issues easily.

I've rearranged libteam patches in the series, to make WR patch last. It will allow us to change WR behaviour more easily

jipanyang · 2019-06-12T22:59:20Z

"If one of the ports was put in OPER DOWN state during the procedure, teamd restores state incorrectly."
In this case, could we just remove the save LACP PDU upon port down event, so it goes to normal lacp processing directly after warm reboot?

Even if it is restored to wrong lacp state, it was in OPER DOWN state, I assume it won't cause issue and could recover shortly? Or some negative effect has been observed?

pavel-shirshov · 2019-06-12T23:24:25Z

@jipanyang Previously libteam was stuck in UP state, and couldn't go down.
Currently this behaviour was fixed.

lguohan · 2019-06-13T05:28:27Z

what is the best way to review the patches?

pavel-shirshov · 2019-06-13T18:14:30Z

@lguohan
All small patches were regenerated, so no big changes, easy to review.

The WR patch:
Clone libteam into two directories.
In the first directory apply all patches
In the second directory apply all patches except the WR patch.
Compare the directories with some diff tool.

yxieca · 2019-06-14T01:43:40Z

the patches are applied individually in commits. one repo is enough. gitk to look at the last commit if you use gui. Or git show to show the last commit in command line.

I tested this change. It works so far. I'll put my vote in tomorrow morning when more iterations of continuous reboot is done. So far about 20 iterations of continuous warm reboot. Things are looking good.

jipanyang · 2019-06-14T05:43:00Z

For all processes within teamd docker, we have an ultimate goal of supporting unplanned warm restart (recovery). The concern I have with this change is that it might have eliminated that possibility.

To fix the incorrect UP state issue, would removing the lacpdu file upon member port down event be sufficient?

pavel-shirshov · 2019-06-14T20:39:48Z

@jipanyang Can you please elaborate on this?
What does it mean "goal of supporting unplanned warm restart (recovery)"
What is the "the incorrect UP state issue"?

if you remove the files, teamd will start in normal mode (PortChannel interface down).

pavel-shirshov · 2019-06-14T20:57:17Z

@jipanyang I've pushed the change and I'm going to made the same change for the master.
Let's discuss your requirements. I can implement a special case for you.

jipanyang · 2019-06-14T22:38:11Z

@pavel-shirshov I was not able to understand the reason of "Previously libteam was stuck in UP state, and couldn't go down". After checking the old patch again, it looks to me the warm_start_carrier_timer check in lacp_update_carrier() is wrong:

+	#define WARM_START_CARRIER_TIMEOUT 3
+	/* wait three seconds until disable warm_start_carrier mode */
+	if (lacp->ctx->warm_start_carrier &&
+	    lacp->warm_start_carrier_timer >= (time(NULL) + WARM_START_CARRIER_TIMEOUT)) {
+		lacp->ctx->warm_start_carrier = false;
+		lacp->warm_start_carrier_timer = 0;
+	}

Should it be like below ?

+	    time(NULL)  >= (lacp->warm_start_carrier_timer + WARM_START_CARRIER_TIMEOUT)) {

Since this PR is quite a big change compared with the original patch, I'd like to check what is absolutely necessary.

As to unplanned warm restart (failure recovery) for teamd, the idea is that we could make the saved lacpdu persistent whenever lacp port reaches PORT_STATE_CURRENT, thus should any processes crashed inside teamd, an automatic docker level warm restart could get the system back to normal. We'll have more time to root cause the crash without affecting the service. In the new implementation, more data than lacpdu are saved, are they necessary or just good to have?

arkadiyshapiro · 2019-06-14T22:45:14Z

@pavel-shirshov - can you or someone else comment on whether existing warm reboot community tests are expected to still work on any vendor platform? Or will this break everything now and vendors have to adapt? It would be great to get some heads up on this...

pavel-shirshov · 2019-06-15T00:29:21Z

@arkadiyshapiro Everything will work as before. No changes is required in the current tests.

pavel-shirshov · 2019-06-15T00:31:09Z

@jipanyang Old patch cause the following
If the LAG port was down before restart (one port down, one port up, min-links==2), after WR we had LAG port is UP, despite one LAG member port is down and the LAG was stuck in this state. You can ask Qi for details.

You're right, the previous timer code was wrong.

If I understand you well you want to have a way to restore teamd behavior when teamd crashed by mimicking WR?
Current patch has LACP PDU dumps as previous patch
But I added another file: It is saved into the same directory. The filename is LAG name. This is a text file It contains:

1 if LAG was carrier up, 0 if LAG was carried down
number of active LAG member ports
name of LAG member port#1
1 if the LAG member port#1 was enabled, 0 if it wasnt'
name of LAG member port#2
1 if the LAG member port#2 was enabled, 0 if it wasnt'

and so on
So for example: File with name PortChannel1 contains

1
2
Ethernet0
1
Erhernet1
1

Means that PortChannel1 was up before restart. The LAG has two member ports: Ethernet0 and Ethernet1. Both ports were enabled before restart.

jipanyang · 2019-06-16T19:34:25Z

@pavel-shirshov thanks for the explanation.

[libteam]: Reimplement Warm-Reboot procedure

4d48c78

pavel-shirshov added Bug 🐛 Enhancement ➕ labels Jun 12, 2019

pavel-shirshov requested review from jipanyang, lguohan and yxieca June 12, 2019 21:59

pavel-shirshov self-assigned this Jun 12, 2019

Address internal comments

e1f395d

lguohan approved these changes Jun 14, 2019

View reviewed changes

pavel-shirshov merged commit f71c665 into sonic-net:201811 Jun 14, 2019

pavel-shirshov deleted the pavelsh/libteam_wr_rewrite branch June 14, 2019 20:56

pavel-shirshov mentioned this pull request Jun 14, 2019

[libteam]: Reimplement Warm-Reboot procedure #3016

Merged

Conversation

pavel-shirshov commented Jun 12, 2019

Uh oh!

jipanyang commented Jun 12, 2019

Uh oh!

pavel-shirshov commented Jun 12, 2019

Uh oh!

lguohan commented Jun 13, 2019

Uh oh!

pavel-shirshov commented Jun 13, 2019

Uh oh!

yxieca commented Jun 14, 2019

Uh oh!

jipanyang commented Jun 14, 2019

Uh oh!

pavel-shirshov commented Jun 14, 2019

Uh oh!

pavel-shirshov commented Jun 14, 2019

Uh oh!

jipanyang commented Jun 14, 2019

Uh oh!

arkadiyshapiro commented Jun 14, 2019

Uh oh!

pavel-shirshov commented Jun 15, 2019

Uh oh!

pavel-shirshov commented Jun 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jipanyang commented Jun 16, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

pavel-shirshov commented Jun 15, 2019 •

edited

Loading