[fast-reboot]: New version of fast reboot test#204
[fast-reboot]: New version of fast reboot test#204pavel-shirshov merged 11 commits intosonic-net:masterfrom pavel-shirshov:pavelsh/fr_update
Conversation
| # if LACP is inactive on VMs and interfaces goes down test failes | ||
| # if BGP graceful restart timeout is not equal 120 seconds the test fails | ||
| # if BGP graceful restart is not enabled on DUT the test fails | ||
| # If BGP graceful restart timeout is less than 15 seconds the test fails |
There was a problem hiding this comment.
If BGP graceful restart timeout is less than 15 seconds the test fails [](start = 5, length = 70)
this is contradictory with the previous statement.
"if BGP graceful restart timeout is not equal 120 seconds the test fails"
There was a problem hiding this comment.
previous statement was about static,configured value of graceful restart, this one about dynamic value
| # 1. Check that DUT is stable. That means that "pings" work in both directions: from T1 to servers and from servers to T1. | ||
| # 2. If DUT is stable the test starts continiously pinging DUT in both directions. | ||
| # 3. The test runs '/usr/bin/fast-reboot' on DUT remotely. The ssh key supposed to be uploaded by ansible before the test | ||
| # 3. As soon as it sees that ping starts failuring in one of directions the test registers a start of dataplace disruption |
There was a problem hiding this comment.
3 [](start = 2, length = 1)
step 4.
| def parse_lacp(self, output): | ||
| return output.find('Bundled') != -1 | ||
|
|
||
| def parse_bgp_neig_once(self, output): |
There was a problem hiding this comment.
neig? can we use the full name?
|
|
||
| return stdout, stderr, return_code | ||
|
|
||
| def ssh_job(self, ip, queue): |
There was a problem hiding this comment.
ssh_job [](start = 8, length = 7)
can we make the name more meanful? like peer_state_check
|
|
||
| def ping_iteration(self): | ||
| return self.pingFromServers() > 0 and self.pingFromUpperTier() > 0 | ||
| nr_from_servers = self.pingFromServers() |
There was a problem hiding this comment.
it is for number
| q.put('go') | ||
| if success and is_stop or not success and not is_stop: | ||
| self.log("Success", True) | ||
| self.log("Base state", True) |
There was a problem hiding this comment.
Base state [](start = 26, length = 10)
what is base state and what is changed state?
There was a problem hiding this comment.
We're looking for a change in this function. So "Base state" it's a state which is expected before the change, "Changed" state it's a state which is expected after change.
| self.log("Wait until ASIC stops") | ||
| self.timeout(self.task_timeout, "DUT hasn't stopped in %d seconds" % self.task_timeout) | ||
| no_routing_start = self.check_stop() | ||
| no_routing_start, upper_replies = self.check_stop() |
There was a problem hiding this comment.
check_stop [](start = 47, length = 10)
rename to check_forwarding_stop?
| self.log("ASIC was stopped, Waiting until it's up. Stop time: %s" % str(no_routing_start)) | ||
| self.timeout(self.task_timeout, "DUT hasn't started to work for %d seconds" % self.task_timeout) | ||
| no_routing_stop = self.check_start() | ||
| no_routing_stop, _ = self.check_start() |
There was a problem hiding this comment.
check_start [](start = 34, length = 11)
rename to check_forwarding_resume?
|
|
||
| self.assertTrue(no_routing_stop - no_routing_start < self.limit, "Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit']) | ||
| self.log("Reboot time was %s" % str(no_routing_stop - self.reboot_start)) | ||
| self.log("Number replies when control plane was down: %d Expected: %d" % (no_cp_replies, self.nr_vl_pkts)) |
There was a problem hiding this comment.
what is this reply? can you explain?
|
|
||
| return [('lo', lo_route, lo_valid, lo_prefix, lo_nexthop), ('vlan', vlan_route, vlan_valid, vlan_prefix, vlan_nexthop)] | ||
|
|
||
| def test_all_client_cases(self, output): |
There was a problem hiding this comment.
test_all_client_cases [](start = 8, length = 21)
rename to check_all_peer_status?
| self.fails['dut'].add("Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit']) | ||
| if no_routing_stop - self.reboot_start > datetime.timedelta(seconds=self.test_params['graceful_limit']): | ||
| self.fails['dut'].add("Fast-reboot cycle must be less then graceful limit %s seconds" % self.test_params['graceful_limit']) | ||
| if no_cp_replies < 0.95 * self.nr_vl_pkts: |
There was a problem hiding this comment.
if no_cp_replies < 0.95 * self.nr_vl_pkts: [](start = 5, length = 45)
why 0.95? when control plane is down, the data plane should be 100% up in our case, right?
There was a problem hiding this comment.
In case fanout switches, Linux kernel, NIC drivers, ptf itself don't lose any packets which is not always true. So I made it 95% which work for me.
| if len(non_zero) > 1 and non_zero[-1] < non_zero[-2]: | ||
| return non_zero[-2] | ||
| else: | ||
| return non_zero[-1] |
There was a problem hiding this comment.
maybe it is more robust just to find out the 90% of the data for non zero. Meaning 90% percentage of time, the you have the reachability to the n host, and the n should be equal to nr_vl_pkts.
There was a problem hiding this comment.
Honestly It sounds more complicated to me.
| vlan_ip_range = self.test_params['vlan_ip_range'] | ||
|
|
||
| _, mask = vlan_ip_range.split('/') | ||
| n_hosts = min(2**(32 - int(mask)) - 3, self.max_nr_vl_pkts) |
There was a problem hiding this comment.
n_hosts = min(2**(32 - int(mask)) - 3, self.max_nr_vl_pkts) [](start = 8, length = 59)
i think it is better to simply the test, and we just define the number of hosts under the vlan and send 1 packet per host every iteration.
sending multiple packet per host does not seem neccessary and it complicates the data interpretation later.
There was a problem hiding this comment.
My code calculates the minimum number of hosts which could be used with current size of Vlan network. Our preference is to check against 1000 vlan hosts. But if size of vlan network is less then /22, we can't test for 1000 vlan hosts. So I take maximum possible value for the vlan network size..
Also. I send only one packet per host.
|
|
||
| gr_active, timer = other['bgp_neig'] | ||
| # wnen it's False, it's ok, wnen it's True, check that inactivity timer not less then 15 seconds | ||
| if gr_active and datetime.datetime.strptime(timer, '%H:%M:%S') < datetime.datetime(1900, 1, 1, second = 15): |
There was a problem hiding this comment.
15 [](start = 116, length = 2)
make this as an parameter?
| if no_routing_stop - no_routing_start > self.limit: | ||
| self.fails['dut'].add("Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit']) | ||
| if no_routing_stop - self.reboot_start > datetime.timedelta(seconds=self.test_params['graceful_limit']): | ||
| self.fails['dut'].add("Fast-reboot cycle must be less then graceful limit %s seconds" % self.test_params['graceful_limit']) |
There was a problem hiding this comment.
then [](start = 66, length = 4)
than
| self.log("ASIC was stopped, Waiting until it's up. Stop time: %s" % str(no_routing_start)) | ||
| self.timeout(self.task_timeout, "DUT hasn't started to work for %d seconds" % self.task_timeout) | ||
| no_routing_stop = self.check_start() | ||
| no_routing_stop, _ = self.check_start() |
There was a problem hiding this comment.
I wonder what should be the final exit criteria for fast-reboot? whether we need to check if all peer bgp session are up (exit from graceful restart mode)?
There was a problem hiding this comment.
I defined the exit criteria as working dataplane. I can check that all bgp session are up at the end of the test.
* New version of fast reboot test
…lure (sonic-net#204) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) This PR fixes issue sonic-net#9052, where '_test_default_route_with_bgp_flap_' fails with "_Default route nexthops doesn't match the testbed topology_" issue. The count of the next hops is not matching with the upstream neighbor as per the topology (reading data from config facts). ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] Test case(new/improvement) ### Back port request - [ ] 202012 - [ ] 202205 - [ ] 202305 - [ ] 202311 - [x] 202405 ### Approach #### What is the motivation for this PR? - '_test_default_route_with_bgp_flap_' fails with "_Default route nexthops doesn't match the testbed topology_" issue. - The count of the next hops is not matching with the upstream neighbor as per the topology. #### How did you do it? - Instead of skipping T3 neighbors of AZNGHub, use the address family and next hops information to find out the list of upstream neighbors. #### How did you verify/test it? - Ran the above-mentioned test case on a T2 chassis and made sure the test passed without any issues. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->  
* 664f0e2 2021-07-14 | [xrcvd]: Removed undefined symbol 'sfp_status_helper' (sonic-net#204) (HEAD, origin/202012) [Prince George] * 1b2d016 2021-06-16 | [CI] sonic-config-engine now depends on SONiC YANG packages (sonic-net#194) [Joe LeVeque] * 1cf5996 2021-07-14 | Introduce mgmtinit delay after transceiver module insertion (sonic-net#201) [Prince George] Signed-off-by: Guohan Lu <[email protected]>
… submodule head (sonic-net#12494) utilities: * 02eb899e 2022-07-12 | [config/load_mgmt_config] Support load IPv6 mgmt IP (sonic-net#2206) (sonic-net#2246) (sonic-net#2256) (HEAD -> 201811, github/201811) [Jing Kan] platform-daemon: * fc288cc 2022-05-05 | Mem leak caused by Xcvrd in Send-Q of REDIS-DB socket connection (sonic-net#260) (HEAD -> 201811, github/201811) [Prince George] platform-common: * edb062b 2022-02-09 | [sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net#261) (HEAD -> 201811, github/201811) [Prince George] kernel: * 9d2d1a1 2022-02-11 | [201811] Increase log_buf_len size to 1M (HEAD -> 201811, github/201811) [Sujin Kang] * b34a213 2022-02-10 | [201811] Increase log_buf_len size to 1M [Samuel Angebault] * c4684cb 2022-02-11 | [201811] Apply kernel patches to fix emmc unreliability [Sujin Kang] * df68771 2022-02-03 | [201811] Apply kernel patches to fix emmc unreliability [Samuel Angebault] * f21cb06 2021-03-26 | [ci]: remove 201811 suffix (sonic-net#204) [lguohan] * 5439b2a 2021-02-08 | [dni_dps460] Add attributes to retrieve PMBus status command codes (sonic-net#194) [Arun Saravanan Balachandran] Signed-off-by: Ying Xie <[email protected]> Signed-off-by: Ying Xie <[email protected]>
Included commits in sonic-py-swsssdk ``` 63c75c1 2021-03-14 | Workaround Mellanox default vlan has no SAI_VLAN_ATTR_VLAN_ID attribute (sonic-net#103) [Qi Luo] ``` Included commits in sonic-snmpagent ``` a8c6e36 2021-03-15 | Implement rfc4363 FdbUpdater for lag inside vlan (sonic-net#204) [Qi Luo] ```
Changes: