Skip to content

[neighsyncd] increase neighbor syncd restore timeout to 110 seconds#745

Merged
yxieca merged 3 commits intosonic-net:masterfrom
yxieca:neigh
Feb 7, 2019
Merged

[neighsyncd] increase neighbor syncd restore timeout to 110 seconds#745
yxieca merged 3 commits intosonic-net:masterfrom
yxieca:neigh

Conversation

@yxieca
Copy link
Copy Markdown
Contributor

@yxieca yxieca commented Jan 10, 2019

What I did
Neighbor syncd is restoring important information for teamd and BGP. therefore, our timeout should not be shorter than the down stream service.

Why I did it
If some interface becomes ready too slow, restore_neighbors script will timeout and leave some neighbors not restored. That will cause sonic to relearn these neighbors and re-create these neighbor entries. And thus that cause data plane disruption.

How I verified it
Without the change, when vlan interface becomes ready 20 seconds later, data plane disruption is noticed by test. With the change, there is no data plane disruption.

Neighbor syncd is restoring important information for teamd and BGP.
our timeout should not be shorter than the down stream service.

Signed-off-by: Ying Xie <[email protected]>
@yxieca yxieca requested a review from zhenggen-xu January 10, 2019 23:54
TIME_OUT = 60
# It would be good to keep that time consistent routing reconciliation time-out, here
# we are upstream, we shouldn't give up before down stream.
TIME_OUT = 120
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yxieca In normal case, if the interface wasn't down or unconfigured during the warm reboot, the neighbor should be restored right after the interfaces are ready ( a few seconds later), and it should usually be faster than the routing reconciliation since protocol takes time to converge.

The timer here is just to make sure we don't stuck if interfaces state was changed (down) during the warm reboot. So we could wait as much as routing stack timer configured.

I think this timer should be slightly smaller than the routing timer, so when the routing reconciliation is done, it already has the reconciled neighbors in place at SONiC orchagent. Set this value same or higher than routing timer means the neighbor could be reconciled after routing is reconciled, it should work just not optimal in the processing pipe of orchagent.

Since you are changing the value here, do you think you can read the routing timer or if not configured use the default one, then set this accordingly here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved other conversations. Leaving this one for final confirmation.

@zhenggen-xu I made change according to our last conversation. Please take a look again.

Thanks,
Ying

Try to get the bgp timeout and use it for restoring neighbor timeout.
When unavailable, use default 110 seconds.

Signed-off-by: Ying Xie <[email protected]>
* This should not happen, if happens, system is in a unknown state, we should exit.
*/
#define RESTORE_NEIGH_WAIT_TIME_OUT 70
#define RESTORE_NEIGH_WAIT_TIME_OUT 180
Copy link
Copy Markdown
Contributor

@qiluo-msft qiluo-msft Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

180 [](start = 36, length = 3)

Could you comment what is the range for valid value? This one is also magic to me. #Closed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also Zhenggen's request. Essentially, this value only need to be bigger than the restore_neighbor.py timeout. But since we are reading the potential DB settings, so he wants this value to be something big, like 180 seconds.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments.

TIME_OUT = 60
# It would be good to keep that time consistent routing reconciliation time-out, here
# we are upstream, we shouldn't give up before down stream.
TIME_OUT = 110
Copy link
Copy Markdown
Contributor

@qiluo-msft qiluo-msft Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

110 [](start = 11, length = 3)

The same. Are they the same thing? #Closed

Copy link
Copy Markdown
Contributor Author

@yxieca yxieca Feb 6, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The BGP default timeout is 120 seconds, Zhenggen want this value to be slightly smaller than the BGP timeout. And he also requested to check if the bgp timeout is set in the state DB. I added the code below.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated comments.

timeout = warmstart.getWarmStartTimer("bgp", "bgp")
if timeout <= 0:
timeout = TIME_OUT

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the neighbor timeout should not be driven by bgp timeout as if we force neighbor restore to timeout, then we could end up with unresolved entries and once we do reconciliation, we will have delete the neighbor in the asic and disrupt the data plane.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

- restore_neighbors.py timeout at 110 seconds due to observed requirement
  of greater than 70 seconds.
- neighbor syncd timeout at 120 seconds (longer than 110 seconds).

Signed-off-by: Ying Xie <[email protected]>
@yxieca
Copy link
Copy Markdown
Contributor Author

yxieca commented Feb 6, 2019

Updated the PR according to the discussion in the warm reboot meeting:

  • Restore_neighbors.py should timeout at 110 seconds.
  • Neighbor syncd should timeout waiting for restore neighbors for 120 seconds (longer than 110 seconds).
  • restore_neighbors.py should not read BGP reconciliation timeout.

@yxieca
Copy link
Copy Markdown
Contributor Author

yxieca commented Feb 7, 2019

retest this please

@yxieca yxieca changed the title [neighsyncd] increase neighbor syncd restore timeout to 120 seconds [neighsyncd] increase neighbor syncd restore timeout to 110 seconds Feb 7, 2019
Copy link
Copy Markdown
Contributor

@jipanyang jipanyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good to me.

@qiluo-msft
Copy link
Copy Markdown
Contributor

Unblocked. Please fix the vs test.

@yxieca yxieca merged commit d680ce2 into sonic-net:master Feb 7, 2019
@yxieca yxieca deleted the neigh branch February 7, 2019 22:41
yxieca added a commit that referenced this pull request Feb 7, 2019
…745)

* [neighsyncd] increase neighbor syncd restore timeout to 120 seconds

Neighbor syncd is restoring important information for teamd and BGP.
our timeout should not be shorter than the down stream service.

Signed-off-by: Ying Xie <[email protected]>

* [restore_neighbor] improve restore neighbor timeouts

Try to get the bgp timeout and use it for restoring neighbor timeout.
When unavailable, use default 110 seconds.

Signed-off-by: Ying Xie <[email protected]>

* Set default values according group discussion result

- restore_neighbors.py timeout at 110 seconds due to observed requirement
  of greater than 70 seconds.
- neighbor syncd timeout at 120 seconds (longer than 110 seconds).

Signed-off-by: Ying Xie <[email protected]>
@yxieca
Copy link
Copy Markdown
Contributor Author

yxieca commented Feb 7, 2019

Made to 201811 branch on 2/7/2019

EdenGri pushed a commit to EdenGri/sonic-swss that referenced this pull request Feb 28, 2022
Add utility to translate text output of syseeprom dump into JSON, for
consumption by new management (1.0) command.

Signed-off-by: Howard Persh <[email protected]>
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
…onic-net#745)

* [neighsyncd] increase neighbor syncd restore timeout to 120 seconds

Neighbor syncd is restoring important information for teamd and BGP.
our timeout should not be shorter than the down stream service.

Signed-off-by: Ying Xie <[email protected]>

* [restore_neighbor] improve restore neighbor timeouts

Try to get the bgp timeout and use it for restoring neighbor timeout.
When unavailable, use default 110 seconds.

Signed-off-by: Ying Xie <[email protected]>

* Set default values according group discussion result

- restore_neighbors.py timeout at 110 seconds due to observed requirement
  of greater than 70 seconds.
- neighbor syncd timeout at 120 seconds (longer than 110 seconds).

Signed-off-by: Ying Xie <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants