Warm reboot: Add basic schema for warm start in configDB and stateDB.#553
Warm reboot: Add basic schema for warm start in configDB and stateDB.#553lguohan merged 5 commits intosonic-net:masterfrom
Conversation
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
doc/swss-schema.md
Outdated
| key = WARM_RESTART_TABLE:process_name ; process_name is unique process identifier | ||
| restart_count = 1*10DIGIT ; a number between 0 and 2147483647, | ||
| ; count of warm start times. | ||
| state = "init" / "restored" / "synced" |
There was a problem hiding this comment.
can you add definition for each phase?
There was a problem hiding this comment.
restored: restore the previous
synced-> reconciled-> reconcile the previous view and new view
doc/swss-schema.md
Outdated
| ;Status: work in progress | ||
|
|
||
| key = WARM_RESTART:name ; name is the name of SONiC docker or "system" for global configuration | ||
| enable = "true" / "false" ; default value as false |
There was a problem hiding this comment.
timer to determine when to do the reconciliation.
There was a problem hiding this comment.
I thought we had agreed to remove this docker-specific knob to turn on/off this feature, and make use of a system-wide knob instead, right?
There was a problem hiding this comment.
I think the agreement is to keep both docker-specific knob and system-wide knob.
The docker-specific knob will mainly be used by developer.
There was a problem hiding this comment.
I agree to have both knobs which give convenience . Can we be more specific in the document how system wide knob work on conjunction with docker knob?
There was a problem hiding this comment.
Agree, I like the idea of keeping a docker-specific knob for tshooting purposes.
There was a problem hiding this comment.
The expected behavior:
<1> If system wide knob is enabled, then docker level knob will be ignored.
There is no point doing system warm reboot while specific docker warm restart is disabled.
<2> If system wide knob is disabled, docker level knob will be checked.
The use case is that we may want to docker level upgrade for dockers like bgp and swss.
Will put more info in the schema comment section.
doc/swss-schema.md
Outdated
|
|
||
| ### Configuration files | ||
| --------------------------------------------- | ||
| ### WARM\_RESTART\_TABLE |
There was a problem hiding this comment.
should be put into state db.
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
|
@zhenggen-xu @rodnymolina Please help review it. |
doc/swss-schema.md
Outdated
|
|
||
| key = WARM_RESTART:name ; name is the name of SONiC docker or "system" for global configuration | ||
| enable = "true" / "false" ; default value as false | ||
| timer = 1*4DIGIT ; timer to determine when reconciliation is done, in seconds. |
There was a problem hiding this comment.
add comment that "not all docker will need the reconciliation timer, for example if BGP docker use GR then as soon as the EOR message is received, the reconciliation is finished."
There was a problem hiding this comment.
A couple of points here...
-
The bgp-restartability solution that i've been working on does not rely on EOR functionality -- neither FRR nor Quagga are capable today of generating this message from BGP down to Zebra.
-
Even if we had an EOR message coming to fpmSyncd, we would still need a timer to make sure that the reconciliation process is always executed after a reasonable amount of time has elapsed -- the EOR message could get lost or not being generated due to a bug or diverse reasons.
There was a problem hiding this comment.
Btw, as discussed, i'm fine with this docker/feature-specific timer.
There was a problem hiding this comment.
neighborsyncd will need the timer as well, this timer is not necessarily docker specific, but more process specific.
BTW: Should we keep the table key/fields definition in some file, so we know what exactly the tables are supported?
There was a problem hiding this comment.
If the timer is more process specific, some update here is necessary since the key here is either "system" or docker name. For now, we don't want to have support for process level warm restart knob(?) .
I guess this swss-schema.md file is a good place to keep the key/fields definitions for all the DBs?
There was a problem hiding this comment.
Maybe a little bit flexibility will help:
timer_name = 1*255VCHAR ; timer_name is the name of warm start timer for the whole system or the specific docker,
; Ex. "neighbor_timer", "bgp_timer".
; There could be more than one timer in a docker and the system.
timer_duration = 1*4DIGIT ; timer duration, in seconds.
There was a problem hiding this comment.
Looks good to me.
I think we need put the table key/fields definitions in the code (e,g, .h file) as well, so it is synced by all code. This can be done later.
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
|
|
||
| enable = "true" / "false" ; Default value as false. | ||
| ; If "system" warm start knob is true, docker level knob will be ignored. | ||
| ; If "system" warm start knob is false, docker level knob takes effect. |
There was a problem hiding this comment.
docker knob always override system level knob, that was the use cases we dicussed.
There was a problem hiding this comment.
The tricky point is the default value is false. If docker knob always override system level knob, when warm restart is enabled at system level, each docker has to be enabled too.
What is the use case of having warm restart enabled at system level, while disabled for certain docker?
There was a problem hiding this comment.
if docker knob does not have this attribute, then use system level.
the use case was rodney's, snmp docker snmp warm reboot is available but experimental, so in production he wants to explicitly disable it although system level is enabled.
There was a problem hiding this comment.
@rodnymolina for snmp case, probably it should alway do cold boot as other dockers like lldp, dhcp_relay until its warm reboot support is stable enough.
If the existence of certain attribute is treated as one condition, then it becomes tri-state variable: true/false/not-exist, to me, it seems error prone. We don't have such case in sonic yet(?).
There was a problem hiding this comment.
In general, more specific knobs take precedence over global configuration. I think we can use this principle here as well.
If we don't like the not-exist case, we need this knob for every docker. For the time-being, we can define the knobs for dockers which have code to support warm reboot, for the ones do not support warm reboot, they do not look at the knobs anyways.
There was a problem hiding this comment.
If docker knob always takes precedence over system knob, the not-exist case is asked implicitly, otherwise why system knob is needed if it gets preempted all the time. That means the default value for enable is not-exist.
The knobs are also used to control whether to flush DB.
Ex. based on the current schema, if system knob or one docker knob enabled, we keep the appDB and some other DBs data.
But if docker knob always takes precedence, and it is disabled, should the DB be flushed or not. If DB is not flushed, the specific docker will have to handle the scenarios of DB with existing data though it is supposed to do a cold start. If DB is flushed, how about other dockers?
Better not to support the case of system knob enabled, while docker knob disabled and takes effect. That is one of the reasons for If "system" warm start knob is true, docker level knob will be ignored.
There was a problem hiding this comment.
maybe we should just simplify it by putting it into DEVICE_METADATA:localhost
warm_restart@: none, system, or a list of individual docker names
for most of use cases, it is system or none. if a user/developer wants to customized, he will specify a list of docker names.
warm_restart_timer_bgp: time_duration
warm_restart_timer_neighbor: time_duration
There was a problem hiding this comment.
Quite a few un-documented configurations have been put under DEVICE_METADATA.
If the same set of warm restart configurations are used, squashing them into DEVICE_METADATA:localhost doesn't seem to bring any benefit other than to make the configuration more obfuscated. Also more parameters might be needed for warm restart.
There was a problem hiding this comment.
@jipanyang As talked, this "system" knob was meant for whole system warm reboot, not a global setting of docker restart, that would lead to a different discussion.
Could you please help to provide (the current) work flow of using these knobs during the shutdown and bringing up of the dockers/system? We can then think whether the whole work flow make sense and keeps everybody on the same page.
|
|
||
| enable = "true" / "false" ; Default value as false. | ||
| ; If "system" warm start knob is true, docker level knob will be ignored. | ||
| ; If "system" warm start knob is false, docker level knob takes effect. |
There was a problem hiding this comment.
After discussion with @zhenggen-xu , a little more clarification about the use cases for system knob and docker knobs.
<1> Upon system reboot, the system enable knob will be checked. If enabled, database data will be preserved, if not, database will be flushed.
No need to check docker level knobs in this case since the whole system is being rebooted .
<2> Upon docker service start, first to check system knob, if enabled, docker warm start should be performed, otherwise system warm reboot will be ruined. If system knob disabled, while docker knob enabled, this is likely an individual docker warm restart request.
Within each application, when the system knob or docker knob enabled, we do further check on the actual warm start state ( restart_count), if no warm start state data available, the database has been flushed, do cold start. Otherwise warm start.
The exact logic has been implemented in WarmStart::checkWarmStart():
https://github.com/Azure/sonic-swss/pull/547/files
There was a problem hiding this comment.
I am fine with this logic.
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
…sonic-net#553) * Add basic schema for warm start schema in configDB and application DB. Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Move warm start table for process stats to state DB Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Add reconciliation timer entry in configDB warm restart table. Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * Update warm restart timer schema Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com> * There might be more than one timer at system level or individual docker Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Signed-off-by: Jipan Yang jipan.yang@alibaba-inc.com
What I did
Basic schema for warm start
Configuration: Mainly to decide whether to flush the DB data and some timer setting for arp/route sync.
AppDB: Store basic warm restore and sync up state and restart count for each process.
Why I did it
How I verified it
Details if related