Post Fast reboot static route configured is missing#6225
Post Fast reboot static route configured is missing#6225sudhanshukumar22 wants to merge 1 commit intosonic-net:masterfrom
Conversation
855e573 to
7ba63c0
Compare
|
@pavel-shirshov , @lguohan - This one has been inactive for a week - how do we move towards merge pls? |
7ba63c0 to
bb30c50
Compare
|
Hi all, |
|
retest vsimage please |
|
does it also happen on the normal reboot? from the description it seems it can happen on normal reboot as well. |
|
@shi-su , we have solve a race condition between zebra and staticd in #6478 , @sudhanshukumar22, can you look at that pr and see if we can the same approach to solve the problem. |
@lguohan , yes, it is a basic testcase and will happen with normal reboot as well. here, we have to create the static route, save the FRR config and reboot. After reboot, the static route present in FRR config will not be replayed. |
bb30c50 to
0925d7b
Compare
|
can you look the other people and use the same approach |
There was a problem hiding this comment.
this is unrelated change, should be move to a different pr.
There was a problem hiding this comment.
OK, I will move it to another PR. This change was done to support split mode along with separate mode. The default is still separate mode.
There was a problem hiding this comment.
unrelated changes. can you pull to a different pr?
There was a problem hiding this comment.
@shi-su , i think the problem you discovered is a generic problem. i think it is better that you pull you script as a delicated script and wait for zebra socket available, and then both bgpd, staticd or other daemons shoud wait for that script to exit.
There was a problem hiding this comment.
Agree. I will rewrite my script as a separate one to wait for zebra to be available and then let the daemons wait for it to exit similar to the code changes here.
There was a problem hiding this comment.
I drafted a PR to pull the script #6519. It took some ideas in this PR, thanks to @sudhanshukumar22. Maybe we can stick to that script. There are some other changes in this PR, maybe they can be rebased onto #6519.
There was a problem hiding this comment.
I will review that code.
There was a problem hiding this comment.
Since @shi-su has done the changes for wait script, I will remove this. I will put my changes on top of them.
|
@sudhanshukumar22, can you take a look at #6519 ? |
dockers/docker-fpm-frr/zsocket.sh
Outdated
There was a problem hiding this comment.
do we really need to wait this long?
There was a problem hiding this comment.
I had put max wait. It will not matter since the script will exit as soon as zebra socket is available.
0925d7b to
938c035
Compare
dockers/docker-fpm-frr/Dockerfile.j2
Outdated
There was a problem hiding this comment.
Is net-tools still necessary here?
There was a problem hiding this comment.
I have one doubt. I am deleting the /etc/sonic/config_db.json and rebooting the system. After reboot, I expect that config db and vtysh config should be empty. But I see that we still have BGP entries in redis DB. e.g.",
router bgp 65100
neighbor 10.0.0.1 remote-as 65200"
Can you suggest where from these entries are coming after reboot. It is expected that when we delete config_db.json and rebooting, the redis DB should also become empty.
There was a problem hiding this comment.
Could you check if /etc/sonic/minigraph.xml exists? I think if config_db.json does not exist, it will try to recover the configuration from minigraph.
There was a problem hiding this comment.
I have checked this file. It is not present in /etc/sonic, but at some other place. But even after deleting the file and deleting /etc/sonic/config_db.json and reboot, I still see that all BGP configurations are retained. I don't know from where this config is getting replayed. I am using default mode (separated).
There was a problem hiding this comment.
Is there anything (e.g., config_db.json, minigraph.xml) in /host/old_config?
dockers/docker-fpm-frr/Dockerfile.j2
Outdated
There was a problem hiding this comment.
This line is already added a few lines later.
There was a problem hiding this comment.
Could you please help clarify why these changes are necessary on this topic? What will happen without these changes?
src/sonic-bgpcfgd/bgpcfgd/main.py
Outdated
There was a problem hiding this comment.
Do we really need to wait for this long? Is there a scenario that the 20-second is not enough?
938c035 to
07fc2d8
Compare
07fc2d8 to
5cd97d0
Compare
- Why I did it
Post Fast reboot static route configured is missing.
Log:
Prior To fast reboot:
2020-11-03 07:48:28,726 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:48:28,828 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:48:28,829 T0000: INFO [D1-H3]
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:01:34
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:04
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] S>* 20.20.20.0/24 [1/0] via 10.10.10.2, Ethernet120, 00:01:23
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:29
2020-11-03 07:48:28,829 T0000: INFO [D1-H3] sonic#
Post Fast Reboot:
2020-11-03 07:51:40,228 T0000: INFO [D1-H3] FCMD: show ip route
2020-11-03 07:51:40,329 T0000: INFO [D1-H3] Codes: K - kernel route, C - connected, S - static, R - RIP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] T - Table, v - VNC, V - VNC-Direct, A - Babel, D - SHARP,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] F - PBR, f - OpenFabric,
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] > - selected route, * - FIB route, q - queued route, r - rejected route
2020-11-03 07:51:40,330 T0000: INFO [D1-H3]
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] K>* 0.0.0.0/0 [0/202] via 10.59.128.1, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.10.10.0/24 is directly connected, Ethernet120, 00:00:58
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 10.59.128.0/20 is directly connected, eth0, 00:02:08
2020-11-03 07:51:40,330 T0000: INFO [D1-H3] C>* 192.168.12.0/24 is directly connected, Ethernet121, 00:00:58
2020-11-03 07:51:40,331 T0000: INFO [D1-H3] sonic#
- How I did it
Allow zebra socket to come up first before static daemon can start. This is because if static daemon comes up before or at the same time as zebra, zebra misses the static route add message from static daemon. This causes static route to be not present in zebra. This is easily seen during reboot (after config save in FRR).
- How to verify it
- Which release branch to backport (provide reason below if selected)
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)