Skip to content

[libteam]: Reimplement Warm-Reboot procedure#2999

Merged
pavel-shirshov merged 2 commits intosonic-net:201811from
pavel-shirshov:pavelsh/libteam_wr_rewrite
Jun 14, 2019
Merged

[libteam]: Reimplement Warm-Reboot procedure#2999
pavel-shirshov merged 2 commits intosonic-net:201811from
pavel-shirshov:pavelsh/libteam_wr_rewrite

Conversation

@pavel-shirshov
Copy link
Contributor

New implementation of teamd support of Warm-Reboot procedure

During the manual testing of the previous Warm-Reboot procedure implementation for teamd we found, that teamd restores state incorrectly, if one of the ports was put in OPER DOWN state during the procedure.
To fix that I redesigned the procedure completely:

  1. When we prepare the system for the WR procedure, we save LACP PDU for every LAG member port (as before), and also additional information about current LAG:
  • number of LAG members
  • interface names of the LAG members
  • operational states of the LAG members
  1. When we start the system in WR state, we read the saved LAG information, and this information allows us to restore the state correctly.
  • if LAG state before the reboot was down, we disable the WR mode immediately, we have to start in the normal mode
  • if LAG state before the reboot was up, we start with enabled LAG interface, to don't disrupt the dataplane and start restoring state of LAG members
  • when SONiC adds LAG interfaces, teamd runs lacp_update_carrier() function. This function is used to calculate LAG interface operational state. We keep the state up, while we're in WR mode. Every time when lacp_update_carrier() is executed, we check operational state of LAG member interfaces. If it's up, we read the LACP PDU from files, if it's not up we wait.
  • After we execute lacp_update_carrier() once, we start timer. If we weren't able to restore the state for more than 3 seconds, we stop WR mode and start working as usual.
  • As soon as we read all LAG member interfaces state, we can disable WR-mode and run teamd as usual

The WR start logic was completely moved to lacp_update_carrier().
I've added a lot of debug messages for WR mode, which will allow us to find issues easily.

I've rearranged libteam patches in the series, to make WR patch last. It will allow us to change WR behaviour more easily

@jipanyang
Copy link
Collaborator

"If one of the ports was put in OPER DOWN state during the procedure, teamd restores state incorrectly."
In this case, could we just remove the save LACP PDU upon port down event, so it goes to normal lacp processing directly after warm reboot?

Even if it is restored to wrong lacp state, it was in OPER DOWN state, I assume it won't cause issue and could recover shortly? Or some negative effect has been observed?

@pavel-shirshov
Copy link
Contributor Author

@jipanyang Previously libteam was stuck in UP state, and couldn't go down.
Currently this behaviour was fixed.

@lguohan
Copy link
Collaborator

lguohan commented Jun 13, 2019

what is the best way to review the patches?

@pavel-shirshov
Copy link
Contributor Author

@lguohan
All small patches were regenerated, so no big changes, easy to review.

The WR patch:
Clone libteam into two directories.
In the first directory apply all patches
In the second directory apply all patches except the WR patch.
Compare the directories with some diff tool.

@yxieca
Copy link
Contributor

yxieca commented Jun 14, 2019

the patches are applied individually in commits. one repo is enough. gitk to look at the last commit if you use gui. Or git show to show the last commit in command line.

I tested this change. It works so far. I'll put my vote in tomorrow morning when more iterations of continuous reboot is done. So far about 20 iterations of continuous warm reboot. Things are looking good.

@jipanyang
Copy link
Collaborator

For all processes within teamd docker, we have an ultimate goal of supporting unplanned warm restart (recovery). The concern I have with this change is that it might have eliminated that possibility.

To fix the incorrect UP state issue, would removing the lacpdu file upon member port down event be sufficient?

@pavel-shirshov
Copy link
Contributor Author

@jipanyang Can you please elaborate on this?
What does it mean "goal of supporting unplanned warm restart (recovery)"
What is the "the incorrect UP state issue"?

if you remove the files, teamd will start in normal mode (PortChannel interface down).

@pavel-shirshov pavel-shirshov merged commit f71c665 into sonic-net:201811 Jun 14, 2019
@pavel-shirshov pavel-shirshov deleted the pavelsh/libteam_wr_rewrite branch June 14, 2019 20:56
@pavel-shirshov
Copy link
Contributor Author

@jipanyang I've pushed the change and I'm going to made the same change for the master.
Let's discuss your requirements. I can implement a special case for you.

@jipanyang
Copy link
Collaborator

@pavel-shirshov I was not able to understand the reason of "Previously libteam was stuck in UP state, and couldn't go down". After checking the old patch again, it looks to me the warm_start_carrier_timer check in lacp_update_carrier() is wrong:

+	#define WARM_START_CARRIER_TIMEOUT 3
+	/* wait three seconds until disable warm_start_carrier mode */
+	if (lacp->ctx->warm_start_carrier &&
+	    lacp->warm_start_carrier_timer >= (time(NULL) + WARM_START_CARRIER_TIMEOUT)) {
+		lacp->ctx->warm_start_carrier = false;
+		lacp->warm_start_carrier_timer = 0;
+	}

Should it be like below ?

+	    time(NULL)  >= (lacp->warm_start_carrier_timer + WARM_START_CARRIER_TIMEOUT)) {

Since this PR is quite a big change compared with the original patch, I'd like to check what is absolutely necessary.

As to unplanned warm restart (failure recovery) for teamd, the idea is that we could make the saved lacpdu persistent whenever lacp port reaches PORT_STATE_CURRENT, thus should any processes crashed inside teamd, an automatic docker level warm restart could get the system back to normal. We'll have more time to root cause the crash without affecting the service. In the new implementation, more data than lacpdu are saved, are they necessary or just good to have?

@arkadiyshapiro
Copy link

@pavel-shirshov - can you or someone else comment on whether existing warm reboot community tests are expected to still work on any vendor platform? Or will this break everything now and vendors have to adapt? It would be great to get some heads up on this...

@pavel-shirshov
Copy link
Contributor Author

@arkadiyshapiro Everything will work as before. No changes is required in the current tests.

@pavel-shirshov
Copy link
Contributor Author

pavel-shirshov commented Jun 15, 2019

@jipanyang Old patch cause the following
If the LAG port was down before restart (one port down, one port up, min-links==2), after WR we had LAG port is UP, despite one LAG member port is down and the LAG was stuck in this state. You can ask Qi for details.

You're right, the previous timer code was wrong.

If I understand you well you want to have a way to restore teamd behavior when teamd crashed by mimicking WR?
Current patch has LACP PDU dumps as previous patch
But I added another file: It is saved into the same directory. The filename is LAG name. This is a text file It contains:

1 if LAG was carrier up, 0 if LAG was carried down
number of active LAG member ports
name of LAG member port#1
1 if the LAG member port#1 was enabled, 0 if it wasnt'
name of LAG member port#2
1 if the LAG member port#2 was enabled, 0 if it wasnt'

and so on
So for example: File with name PortChannel1 contains

1
2
Ethernet0
1
Erhernet1
1

Means that PortChannel1 was up before restart. The LAG has two member ports: Ethernet0 and Ethernet1. Both ports were enabled before restart.

@jipanyang
Copy link
Collaborator

@pavel-shirshov thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants