Skip to content

Warm reboot: Support vlanmgrd process warm restart#550

Merged
lguohan merged 7 commits intosonic-net:masterfrom
jipanyang:warm_reboot_collab_3_vlanmgrd
Aug 16, 2018
Merged

Warm reboot: Support vlanmgrd process warm restart#550
lguohan merged 7 commits intosonic-net:masterfrom
jipanyang:warm_reboot_collab_3_vlanmgrd

Conversation

@jipanyang
Copy link
Copy Markdown
Contributor

Signed-off-by: Jipan Yang jipan.yang@alibaba-inc.com

What I did
Add support for vlanmgrd process warm restart.

Why I did it

How I verified it

Check warm restart count of the swss processes

root@sonic:/home/admin# redis-cli  
127.0.0.1:6379> keys WAR*
1) "WARM_RESTART_TABLE:portsyncd"
2) "WARM_RESTART_TABLE:orchagent"
3) "WARM_RESTART_TABLE:vlanmgrd"
4) "WARM_RESTART_TABLE:neighsyncd"

127.0.0.1:6379> hgetall  "WARM_RESTART_TABLE:orchagent"
1) "restart_count"
2) "1"
3) "state_restored"
4) "true"
127.0.0.1:6379> hgetall "WARM_RESTART_TABLE:vlanmgrd"
1) "restart_count"
2) "1"

Kill vlanmgrd process and start it again

root@sonic:/# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  0.2  56476 16852 ?        Ss+  01:41   0:00 /usr/bin/python /usr/bin/supervisord
root        50  0.0  0.0 258684  3204 ?        Sl   01:41   0:00 /usr/sbin/rsyslogd -n
root        64  2.4  0.2 192820 16956 ?        Sl   01:41   0:03 /usr/bin/orchagent -d /var/log/swss -b 8192 -m 00:05:64:30:73:c0
root        89  0.0  0.0  99864  3248 ?        Sl   01:41   0:00 /usr/bin/intfsyncd
root        92  0.0  0.0  99896  3788 ?        Sl   01:41   0:00 /usr/bin/neighsyncd
root       113  0.0  0.0  99876  3840 ?        Sl   01:41   0:00 /usr/bin/vlanmgrd
root       120  0.0  0.0  99820  4036 ?        Sl   01:41   0:00 /usr/bin/intfmgrd
root       131  0.0  0.0  99852  3760 ?        Sl   01:41   0:00 /usr/bin/buffermgrd -l /usr/share/sonic/hwsku/pg_profile_lookup.ini
root       146  0.0  0.0  20048  2884 ?        S    01:41   0:00 bash -c /usr/bin/arp_update; sleep 300
root       157  0.0  0.0   4236   712 ?        S    01:41   0:00 sleep 300
root       423  0.0  0.0  20244  3268 ?        Ss   01:43   0:00 /bin/bash
root       565  1.6  0.0  99804  3936 ?        Sl   01:43   0:00 /usr/bin/portsyncd -p /usr/share/sonic/hwsku/port_config.ini
root       585  0.0  0.0  17504  2180 ?        R+   01:43   0:00 ps aux
root@sonic:/# 
root@sonic:/# 
root@sonic:/# 
root@sonic:/# 
root@sonic:/# 
root@sonic:/# pkill -x vlanmgrd
root@sonic:/# supervisorctl start vlanmgrd
vlanmgrd: started
root@sonic:/# 
root@sonic:/# ps -x 
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss+    0:01 /usr/bin/python /usr/bin/supervisord
   50 ?        Sl     0:00 /usr/sbin/rsyslogd -n
   64 ?        Sl     0:05 /usr/bin/orchagent -d /var/log/swss -b 8192 -m 00:05:64:30:73:c0
   89 ?        Sl     0:00 /usr/bin/intfsyncd
   92 ?        Sl     0:00 /usr/bin/neighsyncd
  120 ?        Sl     0:00 /usr/bin/intfmgrd
  131 ?        Sl     0:00 /usr/bin/buffermgrd -l /usr/share/sonic/hwsku/pg_profile_lookup.ini
  423 ?        Ss     0:00 /bin/bash
  565 ?        Sl     0:33 /usr/bin/portsyncd -p /usr/share/sonic/hwsku/port_config.ini
 6539 ?        Sl     0:00 /usr/bin/vlanmgrd
 6965 ?        S      0:00 bash -c /usr/bin/arp_update; sleep 300
 6976 ?        S      0:00 sleep 300
 7304 ?        R+     0:00 ps -x

No traffic loss, also check restart count again, vlanmgrd restart_count incremented by 1

127.0.0.1:6379> 
127.0.0.1:6379> hgetall "WARM_RESTART_TABLE:vlanmgrd"
1) "restart_count"
2) "2"
127.0.0.1:6379> hgetall  "WARM_RESTART_TABLE:orchagent"
1) "restart_count"
2) "1"
3) "state_restored"
4) "true"

Details if related

Has dependency on

sonic-net/sonic-swss-common#211
#547

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Jul 27, 2018

based on our discussion, we need to vs test for the warm reboot work flow. please add.

{
// Don't reset vlan aware bridge upon swss docker warm restart.
SWSS_LOG_INFO("vlanmgrd warm start, skipping bridge create");
return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not sure if this approach is bullet proof.

what if we later change vlan_filtering option, or enable more option for the bridge. it could happen that older version does not have that option, but new vlanmgrd will enable that option, but the warm reboot will miss it.

I think the right approach should still remove all of them and add new, this is mainly control plane, then we can still achieve non data plane disruption.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removal of vlan doesn't affect data plane directly, but BGP docker and BGP will be affected and cause route flapping.

If there is vlan_filtering option change though unlikely for now, probably the easier way is to handle that explicitly.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not particularly worry about vlan_filtering, I am more worry about future bridge attribute, maybe disable unknown multicast, unknown unicast options.

for bgp docker, I think we can do docker pause.

https://docs.docker.com/engine/reference/commandline/pause/

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete/create of bridge and vlan is done in linux kernel, and zebra listen on that, pausing docker in this case may trigger unknown side effect since we don't know the exact timing of netlink message. It also makes interface handling more complex.

For disabling unknown multicast, unknown unicast, probably we should add configuration option for them, that was in the original vlan trunk pull request. We may bring it back and refine the change later.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is that in the future you never know what you are going to add for the bridge. Therefore, we should create exactly the same one as we create in cold boot.

To ensure that, the cleanest approach is to remove and recreate, then we can share the same code path as the cold boot.

if this can cause control plane disruption, we should then stop the bgp container and do the bgp gr.


In reply to: 205924420 [](ancestors = 205924420)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides system level warm reboot, we want to support docker warm restart. We intend to have same code path for cold boot and warm boot whenever possible.

For this case, it is in constructor phase. I don't see why new option has to be put here instead of as a configuration option. Also we use docker to separate the services, stopping other docker when doing operation in one docker doesn't seem that clean.

@jipanyang
Copy link
Copy Markdown
Contributor Author

For VS testing, I was thinking of adding them after the basic framework is ready. it looks we could add some basic test like verifying bridge vlan info, restart count, and expand the test later with more verification. Will work on it.

@jipanyang
Copy link
Copy Markdown
Contributor Author

jipan@sonic-build-2:~/warm_reboot/sonic-buildimage/src/sonic-swss/tests$ sudo pytest -v --dvsname=vs test_warm_reboot.py
======================================================================= test session starts =======================================================================
platform linux2 -- Python 2.7.12, pytest-3.3.0, py-1.5.4, pluggy-0.6.0 -- /usr/bin/python
cachedir: .cache
rootdir: /home/jipan/warm_reboot/sonic-buildimage/src/sonic-swss/tests, inifile:
collected 1 item

test_warm_reboot.py::test_VlanMgrdWarmRestart PASSED [100%]

==================================================================== 1 passed in 39.69 seconds ====================================================================

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@jipanyang jipanyang force-pushed the warm_reboot_collab_3_vlanmgrd branch from adb8c08 to c146887 Compare July 31, 2018 18:46
@jipanyang jipanyang changed the title Support vlanmgrd process warm restart Warm reboot: Support vlanmgrd process warm restart Aug 1, 2018
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@lguohan
Copy link
Copy Markdown
Contributor

lguohan commented Aug 3, 2018

retest this please

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@jipanyang
Copy link
Copy Markdown
Contributor Author

jipanyang commented Aug 9, 2018

Any reason why it failed on Jenkins server? it works for me locally and CFG_WARM_RESTART_TABLE_NAME has been defined in swss-common schema.h.

        # enable warm restart
        # TODO: use cfg command to config it
        create_entry_tbl(
            conf_db,
>           swsscommon.CFG_WARM_RESTART_TABLE_NAME, "swss",
            [
                ("enable", "true"),
            ]
        )
E       AttributeError: 'module' object has no attribute 'CFG_WARM_RESTART_TABLE_NAME'

test_warm_reboot.py:108: AttributeError

…s PR

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
…ME to avoid jenkins environment error

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@jipanyang jipanyang force-pushed the warm_reboot_collab_3_vlanmgrd branch from 37a4a5f to 9b83c1e Compare August 13, 2018 03:30
Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
@lguohan lguohan merged commit 76f258b into sonic-net:master Aug 16, 2018
@jipanyang jipanyang deleted the warm_reboot_collab_3_vlanmgrd branch February 9, 2019 02:32
oleksandrivantsiv pushed a commit to oleksandrivantsiv/sonic-swss that referenced this pull request Mar 1, 2023
…ic-net#550)

* [sairedis] Add SkipRecordAttrContainer class

* [sairedis] Start using SkipRecordAttrContainer class
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
* Support vlanmgrd process warm restart

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* [VS]: add test case for vlanmgrd warm restart

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Adapt to the new warm reboot schema

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Update warm_restart common functions

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* warm_restart common functions already available, remove them from this PR

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Use fixed CFG_WARM_RESTART_TABLE_NAME and STATE_WARM_RESTART_TABLE_NAME  to avoid jenkins environment error

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>

* Remove hardcoded names for warm restart config table and state table

Signed-off-by: Jipan Yang <jipan.yang@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants