Fixes problems with database - swss - syncd synchronization.#110
Conversation
…race condition with database flush/set
| [Service] | ||
| User={{ sonicadmin_user }} | ||
| # Wait for redis server start before database clean by checking the server listening port 6379 | ||
| ExecStartPre=/bin/bash -c "while true; do if [ -n \"$(netstat -l | grep 6379)\" ]; then break; fi; sleep 1; done" |
There was a problem hiding this comment.
using "nc -z -w 5 127.0.0.1 6379" to check if the port is open? there could a port 36379 that match your criteria.
There was a problem hiding this comment.
@lguohan netsat output looks as follows on the box:
admin@switch2:~$ netstat -l
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 localhost:6379 : LISTEN
So I guess it's better to check for ":6379". The it will cover all the cases.
There was a problem hiding this comment.
Redis CLI has ping command
There was a problem hiding this comment.
okay, let's check with PING
|
on my box there are lots of more ports. acsadmin@CCPSCH01030BBLF:~$ sudo netstat -l |
|
On my too. I've just posted a part of the log. |
| Requires=database.service | ||
| After=database.service | ||
| Requires=database.service swss.service | ||
| After=database.service swss.service |
There was a problem hiding this comment.
syncd is not depending on swss
and syncd should start before swss
@vitaliy-senchyshyn @lguohan
There was a problem hiding this comment.
swss depends on syncd and swss starts after syncd
There was a problem hiding this comment.
swss.service is clearing the database
There was a problem hiding this comment.
ok I will give a test on this
picked this change from sonic-mgmt repo. sonic-net/sonic-mgmt#110
picked this change from sonic-mgmt repo. sonic-net/sonic-mgmt#110
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) On t2 topo we observed the following failure: `Failed: Not all routes flushed from nexthop 10.0.0.25 on asic 0 on cmp210-4` in tests: ``` pc/test_po_update.py::test_po_update::test_po_update_io_no_loss pc/test_po_voq.py::test_po_voq::test_voq_po_member_update ``` Increasing the timeout of the check_no_routes_from_nexthop call, resolves these issues. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202012 - [ ] 202205 - [ ] 202305 - [ ] 202311 - [x] 202405 - [ ] 202411 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
fc5d424 Jing Zhang Fri Aug 12 14:39:59 2022 -0700 [202012] Cherry-pick flaky unit test fixes (sonic-net#115) faceb93 Jing Zhang Thu Aug 11 10:03:05 2022 -0700 Backoff mux probing for server down scenario (sonic-net#106) 86ddd95 Jing Zhang Fri Aug 12 14:21:37 2022 -0700 Fix race condition caused by strand wrap method (sonic-net#104) (sonic-net#110) f68a03e Jing Zhang Thu Aug 11 15:31:22 2022 -0700 [lgtm]: add uuid-dev to lgtm prepare (sonic-net#112) sign-off: Jing Zhang zhangjing@microsoft.com
This PR fixes problem with database - swss - syncd synchronization.
There are two problems:
Feb 2 12:56:23 switch2 INFO docker[798]: Could not connect to Redis at 127.0.0.1:6379: Connection refused
Feb 2 12:56:23 switch2 INFO docker[798]: Could not connect to Redis at 127.0.0.1:6379: Connection refused
Feb 2 12:56:23 switch2 NOTICE systemd[1]: swss.service: control process exited, code=exited status=1
Feb 2 12:56:23 switch2 ERR systemd[1]: Failed to start switch state service container.
Feb 2 12:56:23 switch2 NOTICE systemd[1]: Unit swss.service entered failed state.
In order to solve this to swss.service is added a bash loop which checks that redis server is up using redis-cli ping command. If it's not the loop sleeps for a second before the next try.
As a solution syncd.service is made dependant on swss.service and should be executed after the last one is started.