[neighsyncd] increase neighbor syncd restore timeout to 110 seconds by yxieca · Pull Request #745 · sonic-net/sonic-swss

yxieca · 2019-01-10T23:54:12Z

What I did
Neighbor syncd is restoring important information for teamd and BGP. therefore, our timeout should not be shorter than the down stream service.

Why I did it
If some interface becomes ready too slow, restore_neighbors script will timeout and leave some neighbors not restored. That will cause sonic to relearn these neighbors and re-create these neighbor entries. And thus that cause data plane disruption.

How I verified it
Without the change, when vlan interface becomes ready 20 seconds later, data plane disruption is noticed by test. With the change, there is no data plane disruption.

Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]>

neighsyncd/restore_neighbors.py

neighsyncd/neighsync.h

zhenggen-xu · 2019-01-12T05:33:41Z

neighsyncd/restore_neighbors.py

-TIME_OUT = 60
+# It would be good to keep that time consistent routing reconciliation time-out, here
+# we are upstream, we shouldn't give up before down stream.
+TIME_OUT = 120


@yxieca In normal case, if the interface wasn't down or unconfigured during the warm reboot, the neighbor should be restored right after the interfaces are ready ( a few seconds later), and it should usually be faster than the routing reconciliation since protocol takes time to converge.

The timer here is just to make sure we don't stuck if interfaces state was changed (down) during the warm reboot. So we could wait as much as routing stack timer configured.

I think this timer should be slightly smaller than the routing timer, so when the routing reconciliation is done, it already has the reconciled neighbors in place at SONiC orchagent. Set this value same or higher than routing timer means the neighbor could be reconciled after routing is reconciled, it should work just not optimal in the processing pipe of orchagent.

Since you are changing the value here, do you think you can read the routing timer or if not configured use the default one, then set this accordingly here?

Resolved other conversations. Leaving this one for final confirmation.

@zhenggen-xu I made change according to our last conversation. Please take a look again.

Thanks,
Ying

Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]>

qiluo-msft · 2019-02-06T05:07:20Z

neighsyncd/neighsync.h

 * This should not happen, if happens, system is in a unknown state, we should exit.
 */
-#define RESTORE_NEIGH_WAIT_TIME_OUT 70
+#define RESTORE_NEIGH_WAIT_TIME_OUT 180


180 [](start = 36, length = 3)

Could you comment what is the range for valid value? This one is also magic to me. #Closed

This is also Zhenggen's request. Essentially, this value only need to be bigger than the restore_neighbor.py timeout. But since we are reading the potential DB settings, so he wants this value to be something big, like 180 seconds.

Updated comments.

qiluo-msft · 2019-02-06T05:07:45Z

neighsyncd/restore_neighbors.py

-TIME_OUT = 60
+# It would be good to keep that time consistent routing reconciliation time-out, here
+# we are upstream, we shouldn't give up before down stream.
+TIME_OUT = 110


110 [](start = 11, length = 3)

The same. Are they the same thing? #Closed

The BGP default timeout is 120 seconds, Zhenggen want this value to be slightly smaller than the BGP timeout. And he also requested to check if the bgp timeout is set in the state DB. I added the code below.

Updated comments.

lguohan · 2019-02-06T20:28:49Z

neighsyncd/restore_neighbors.py

+    timeout = warmstart.getWarmStartTimer("bgp", "bgp")
+    if timeout <= 0:
+        timeout = TIME_OUT
+


the neighbor timeout should not be driven by bgp timeout as if we force neighbor restore to timeout, then we could end up with unresolved entries and once we do reconciliation, we will have delete the neighbor in the asic and disrupt the data plane.

- restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>

yxieca · 2019-02-06T20:51:05Z

Updated the PR according to the discussion in the warm reboot meeting:

Restore_neighbors.py should timeout at 110 seconds.
Neighbor syncd should timeout waiting for restore neighbors for 120 seconds (longer than 110 seconds).
restore_neighbors.py should not read BGP reconciliation timeout.

yxieca · 2019-02-07T16:49:34Z

retest this please

jipanyang

looks good to me.

qiluo-msft · 2019-02-07T22:01:11Z

Unblocked. Please fix the vs test.

…745) * [neighsyncd] increase neighbor syncd restore timeout to 120 seconds Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]> * [restore_neighbor] improve restore neighbor timeouts Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]> * Set default values according group discussion result - restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>

yxieca · 2019-02-07T22:51:47Z

Made to 201811 branch on 2/7/2019

Add utility to translate text output of syseeprom dump into JSON, for consumption by new management (1.0) command. Signed-off-by: Howard Persh <[email protected]>

…onic-net#745) * [neighsyncd] increase neighbor syncd restore timeout to 120 seconds Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]> * [restore_neighbor] improve restore neighbor timeouts Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]> * Set default values according group discussion result - restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>

[neighsyncd] increase neighbor syncd restore timeout to 120 seconds

e59b40c

Neighbor syncd is restoring important information for teamd and BGP. our timeout should not be shorter than the down stream service. Signed-off-by: Ying Xie <[email protected]>

yxieca requested a review from zhenggen-xu January 10, 2019 23:54

stepanblyschak reviewed Jan 11, 2019

View reviewed changes

neighsyncd/restore_neighbors.py Outdated Show resolved Hide resolved

zhenggen-xu reviewed Jan 12, 2019

View reviewed changes

[restore_neighbor] improve restore neighbor timeouts

904f652

Try to get the bgp timeout and use it for restoring neighbor timeout. When unavailable, use default 110 seconds. Signed-off-by: Ying Xie <[email protected]>

qiluo-msft reviewed Feb 6, 2019

View reviewed changes

lguohan reviewed Feb 6, 2019

View reviewed changes

Set default values according group discussion result

c0df948

- restore_neighbors.py timeout at 110 seconds due to observed requirement of greater than 70 seconds. - neighbor syncd timeout at 120 seconds (longer than 110 seconds). Signed-off-by: Ying Xie <[email protected]>

yxieca changed the title ~~[neighsyncd] increase neighbor syncd restore timeout to 120 seconds~~ [neighsyncd] increase neighbor syncd restore timeout to 110 seconds Feb 7, 2019

jipanyang approved these changes Feb 7, 2019

View reviewed changes

yxieca merged commit d680ce2 into sonic-net:master Feb 7, 2019

yxieca deleted the neigh branch February 7, 2019 22:41

Conversation

yxieca commented Jan 10, 2019

Uh oh!

Uh oh!

Uh oh!

zhenggen-xu Jan 12, 2019

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 5, 2019

Choose a reason for hiding this comment

Uh oh!

qiluo-msft Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

qiluo-msft Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 6, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

lguohan Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

yxieca Feb 6, 2019

Choose a reason for hiding this comment

Uh oh!

yxieca commented Feb 6, 2019

Uh oh!

yxieca commented Feb 7, 2019

Uh oh!

jipanyang left a comment

Choose a reason for hiding this comment

Uh oh!

qiluo-msft commented Feb 7, 2019

Uh oh!

yxieca commented Feb 7, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

qiluo-msft Feb 6, 2019 •

edited

Loading

qiluo-msft Feb 6, 2019 •

edited

Loading

yxieca Feb 6, 2019 •

edited

Loading