[libteam]: Reimplement Warm-Reboot procedure#2999
[libteam]: Reimplement Warm-Reboot procedure#2999pavel-shirshov merged 2 commits intosonic-net:201811from
Conversation
|
"If one of the ports was put in OPER DOWN state during the procedure, teamd restores state incorrectly." Even if it is restored to wrong lacp state, it was in OPER DOWN state, I assume it won't cause issue and could recover shortly? Or some negative effect has been observed? |
|
@jipanyang Previously libteam was stuck in UP state, and couldn't go down. |
|
what is the best way to review the patches? |
|
@lguohan The WR patch: |
|
the patches are applied individually in commits. one repo is enough. gitk to look at the last commit if you use gui. Or git show to show the last commit in command line. I tested this change. It works so far. I'll put my vote in tomorrow morning when more iterations of continuous reboot is done. So far about 20 iterations of continuous warm reboot. Things are looking good. |
|
For all processes within teamd docker, we have an ultimate goal of supporting unplanned warm restart (recovery). The concern I have with this change is that it might have eliminated that possibility. To fix the incorrect UP state issue, would removing the lacpdu file upon member port down event be sufficient? |
|
@jipanyang Can you please elaborate on this? if you remove the files, teamd will start in normal mode (PortChannel interface down). |
|
@jipanyang I've pushed the change and I'm going to made the same change for the master. |
|
@pavel-shirshov I was not able to understand the reason of "Previously libteam was stuck in UP state, and couldn't go down". After checking the old patch again, it looks to me the warm_start_carrier_timer check in lacp_update_carrier() is wrong: Should it be like below ? Since this PR is quite a big change compared with the original patch, I'd like to check what is absolutely necessary. As to unplanned warm restart (failure recovery) for teamd, the idea is that we could make the saved lacpdu persistent whenever lacp port reaches PORT_STATE_CURRENT, thus should any processes crashed inside teamd, an automatic docker level warm restart could get the system back to normal. We'll have more time to root cause the crash without affecting the service. In the new implementation, more data than lacpdu are saved, are they necessary or just good to have? |
|
@pavel-shirshov - can you or someone else comment on whether existing warm reboot community tests are expected to still work on any vendor platform? Or will this break everything now and vendors have to adapt? It would be great to get some heads up on this... |
|
@arkadiyshapiro Everything will work as before. No changes is required in the current tests. |
|
@jipanyang Old patch cause the following You're right, the previous timer code was wrong. If I understand you well you want to have a way to restore teamd behavior when teamd crashed by mimicking WR? and so on Means that PortChannel1 was up before restart. The LAG has two member ports: Ethernet0 and Ethernet1. Both ports were enabled before restart. |
|
@pavel-shirshov thanks for the explanation. |
New implementation of teamd support of Warm-Reboot procedure
During the manual testing of the previous Warm-Reboot procedure implementation for teamd we found, that teamd restores state incorrectly, if one of the ports was put in OPER DOWN state during the procedure.
To fix that I redesigned the procedure completely:
The WR start logic was completely moved to lacp_update_carrier().
I've added a lot of debug messages for WR mode, which will allow us to find issues easily.
I've rearranged libteam patches in the series, to make WR patch last. It will allow us to change WR behaviour more easily