[neighsyncd] increase neighbor syncd restore timeout to 110 seconds#745
[neighsyncd] increase neighbor syncd restore timeout to 110 seconds#745yxieca merged 3 commits intosonic-net:masterfrom
Conversation
Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]>
neighsyncd/restore_neighbors.py
Outdated
| TIME_OUT = 60 | ||
| # It would be good to keep that time consistent routing reconciliation time-out, here | ||
| # we are upstream, we shouldn't give up before down stream. | ||
| TIME_OUT = 120 |
There was a problem hiding this comment.
@yxieca In normal case, if the interface wasn't down or unconfigured during the warm reboot, the neighbor should be restored right after the interfaces are ready ( a few seconds later), and it should usually be faster than the routing reconciliation since protocol takes time to converge.
The timer here is just to make sure we don't stuck if interfaces state was changed (down) during the warm reboot. So we could wait as much as routing stack timer configured.
I think this timer should be slightly smaller than the routing timer, so when the routing reconciliation is done, it already has the reconciled neighbors in place at SONiC orchagent. Set this value same or higher than routing timer means the neighbor could be reconciled after routing is reconciled, it should work just not optimal in the processing pipe of orchagent.
Since you are changing the value here, do you think you can read the routing timer or if not configured use the default one, then set this accordingly here?
There was a problem hiding this comment.
Resolved other conversations. Leaving this one for final confirmation.
@zhenggen-xu I made change according to our last conversation. Please take a look again.
Thanks,
Ying
Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]>
neighsyncd/neighsync.h
Outdated
| * This should not happen, if happens, system is in a unknown state, we should exit. | ||
| */ | ||
| #define RESTORE_NEIGH_WAIT_TIME_OUT 70 | ||
| #define RESTORE_NEIGH_WAIT_TIME_OUT 180 |
There was a problem hiding this comment.
180 [](start = 36, length = 3)
Could you comment what is the range for valid value? This one is also magic to me. #Closed
There was a problem hiding this comment.
This is also Zhenggen's request. Essentially, this value only need to be bigger than the restore_neighbor.py timeout. But since we are reading the potential DB settings, so he wants this value to be something big, like 180 seconds.
neighsyncd/restore_neighbors.py
Outdated
| TIME_OUT = 60 | ||
| # It would be good to keep that time consistent routing reconciliation time-out, here | ||
| # we are upstream, we shouldn't give up before down stream. | ||
| TIME_OUT = 110 |
There was a problem hiding this comment.
110 [](start = 11, length = 3)
The same. Are they the same thing? #Closed
There was a problem hiding this comment.
The BGP default timeout is 120 seconds, Zhenggen want this value to be slightly smaller than the BGP timeout. And he also requested to check if the bgp timeout is set in the state DB. I added the code below.
neighsyncd/restore_neighbors.py
Outdated
| timeout = warmstart.getWarmStartTimer("bgp", "bgp") | ||
| if timeout <= 0: | ||
| timeout = TIME_OUT | ||
|
|
There was a problem hiding this comment.
the neighbor timeout should not be driven by bgp timeout as if we force neighbor restore to timeout, then we could end up with unresolved entries and once we do reconciliation, we will have delete the neighbor in the asic and disrupt the data plane.
- restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>
|
Updated the PR according to the discussion in the warm reboot meeting:
|
|
retest this please |
|
Unblocked. Please fix the vs test. |
…745) * [neighsyncd] increase neighbor syncd restore timeout to 120 seconds Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]> * [restore_neighbor] improve restore neighbor timeouts Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]> * Set default values according group discussion result - restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>
|
Made to 201811 branch on 2/7/2019 |
Add utility to translate text output of syseeprom dump into JSON, for consumption by new management (1.0) command. Signed-off-by: Howard Persh <[email protected]>
…onic-net#745) * [neighsyncd] increase neighbor syncd restore timeout to 120 seconds Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]> * [restore_neighbor] improve restore neighbor timeouts Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]> * Set default values according group discussion result - restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>
What I did
Neighbor syncd is restoring important information for teamd and BGP. therefore, our timeout should not be shorter than the down stream service.
Why I did it
If some interface becomes ready too slow, restore_neighbors script will timeout and leave some neighbors not restored. That will cause sonic to relearn these neighbors and re-create these neighbor entries. And thus that cause data plane disruption.
How I verified it
Without the change, when vlan interface becomes ready 20 seconds later, data plane disruption is noticed by test. With the change, there is no data plane disruption.