Skip to content

[fast-reboot]: New version of fast reboot test#204

Merged
pavel-shirshov merged 11 commits intosonic-net:masterfrom
pavel-shirshov:pavelsh/fr_update
Jul 27, 2017
Merged

[fast-reboot]: New version of fast reboot test#204
pavel-shirshov merged 11 commits intosonic-net:masterfrom
pavel-shirshov:pavelsh/fr_update

Conversation

@pavel-shirshov
Copy link
Contributor

@pavel-shirshov pavel-shirshov commented Jun 30, 2017

Changes:

  1. "Pings" are changed. They are sent in two direction (as previous):
    • From one of vlan interfaces to T1 devices. The source ip, source interface, and destination IP are chosen randomly from valid choices. Number of packets are 100
    • From all of portchannel ports to all of vlan ports. The source ip, source interface, and destination IP are chosen sequentially from valid choices.
  2. Currently we have 500 distinct destination vlan addresses. Our target to have 1000 of them.
  3. The test measures the dataplane disruption time (as previous), but the disruption time is configurable now
  4. The test measures the control plane disruption time.
  5. The test measures number of packets, which goes in both directions when there're no control plane (sometimes control plane route our packets instead of dataplane)
  6. When the test starts fast-reboot procedure it connects to all VMs (which emulates T1) and starts fetching statuses of BGP and LACP protocols. If BGP routes goes away from VMs, or LACP interfaces goes down the test fails.
  7. The test checks that DUT has BGP graceful restart enabled and graceful restart timeout is equal to 120 seconds.
  8. The test checks that DUT made fast-reboot until the BGP graceful restart timeout doesn't expired
  9. The test checks that VMs has a right next-hop interface (PortChannel), not Management1

# if LACP is inactive on VMs and interfaces goes down test failes
# if BGP graceful restart timeout is not equal 120 seconds the test fails
# if BGP graceful restart is not enabled on DUT the test fails
# If BGP graceful restart timeout is less than 15 seconds the test fails
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BGP graceful restart timeout is less than 15 seconds the test fails [](start = 5, length = 70)

this is contradictory with the previous statement.

"if BGP graceful restart timeout is not equal 120 seconds the test fails"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previous statement was about static,configured value of graceful restart, this one about dynamic value

# 1. Check that DUT is stable. That means that "pings" work in both directions: from T1 to servers and from servers to T1.
# 2. If DUT is stable the test starts continiously pinging DUT in both directions.
# 3. The test runs '/usr/bin/fast-reboot' on DUT remotely. The ssh key supposed to be uploaded by ansible before the test
# 3. As soon as it sees that ping starts failuring in one of directions the test registers a start of dataplace disruption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 [](start = 2, length = 1)

step 4.

def parse_lacp(self, output):
return output.find('Bundled') != -1

def parse_bgp_neig_once(self, output):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neig? can we use the full name?


return stdout, stderr, return_code

def ssh_job(self, ip, queue):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ssh_job [](start = 8, length = 7)

can we make the name more meanful? like peer_state_check


def ping_iteration(self):
return self.pingFromServers() > 0 and self.pingFromUpperTier() > 0
nr_from_servers = self.pingFromServers()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is nr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is for number

q.put('go')
if success and is_stop or not success and not is_stop:
self.log("Success", True)
self.log("Base state", True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Base state [](start = 26, length = 10)

what is base state and what is changed state?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're looking for a change in this function. So "Base state" it's a state which is expected before the change, "Changed" state it's a state which is expected after change.

self.log("Wait until ASIC stops")
self.timeout(self.task_timeout, "DUT hasn't stopped in %d seconds" % self.task_timeout)
no_routing_start = self.check_stop()
no_routing_start, upper_replies = self.check_stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_stop [](start = 47, length = 10)

rename to check_forwarding_stop?

self.log("ASIC was stopped, Waiting until it's up. Stop time: %s" % str(no_routing_start))
self.timeout(self.task_timeout, "DUT hasn't started to work for %d seconds" % self.task_timeout)
no_routing_stop = self.check_start()
no_routing_stop, _ = self.check_start()
Copy link
Contributor

@lguohan lguohan Jul 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_start [](start = 34, length = 11)

rename to check_forwarding_resume?


self.assertTrue(no_routing_stop - no_routing_start < self.limit, "Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit'])
self.log("Reboot time was %s" % str(no_routing_stop - self.reboot_start))
self.log("Number replies when control plane was down: %d Expected: %d" % (no_cp_replies, self.nr_vl_pkts))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this reply? can you explain?


return [('lo', lo_route, lo_valid, lo_prefix, lo_nexthop), ('vlan', vlan_route, vlan_valid, vlan_prefix, vlan_nexthop)]

def test_all_client_cases(self, output):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_all_client_cases [](start = 8, length = 21)

rename to check_all_peer_status?

self.fails['dut'].add("Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit'])
if no_routing_stop - self.reboot_start > datetime.timedelta(seconds=self.test_params['graceful_limit']):
self.fails['dut'].add("Fast-reboot cycle must be less then graceful limit %s seconds" % self.test_params['graceful_limit'])
if no_cp_replies < 0.95 * self.nr_vl_pkts:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if no_cp_replies < 0.95 * self.nr_vl_pkts: [](start = 5, length = 45)

why 0.95? when control plane is down, the data plane should be 100% up in our case, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case fanout switches, Linux kernel, NIC drivers, ptf itself don't lose any packets which is not always true. So I made it 95% which work for me.

if len(non_zero) > 1 and non_zero[-1] < non_zero[-2]:
return non_zero[-2]
else:
return non_zero[-1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it is more robust just to find out the 90% of the data for non zero. Meaning 90% percentage of time, the you have the reachability to the n host, and the n should be equal to nr_vl_pkts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly It sounds more complicated to me.

vlan_ip_range = self.test_params['vlan_ip_range']

_, mask = vlan_ip_range.split('/')
n_hosts = min(2**(32 - int(mask)) - 3, self.max_nr_vl_pkts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n_hosts = min(2**(32 - int(mask)) - 3, self.max_nr_vl_pkts) [](start = 8, length = 59)

i think it is better to simply the test, and we just define the number of hosts under the vlan and send 1 packet per host every iteration.

sending multiple packet per host does not seem neccessary and it complicates the data interpretation later.

Copy link
Contributor Author

@pavel-shirshov pavel-shirshov Jul 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My code calculates the minimum number of hosts which could be used with current size of Vlan network. Our preference is to check against 1000 vlan hosts. But if size of vlan network is less then /22, we can't test for 1000 vlan hosts. So I take maximum possible value for the vlan network size..
Also. I send only one packet per host.


gr_active, timer = other['bgp_neig']
# wnen it's False, it's ok, wnen it's True, check that inactivity timer not less then 15 seconds
if gr_active and datetime.datetime.strptime(timer, '%H:%M:%S') < datetime.datetime(1900, 1, 1, second = 15):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15 [](start = 116, length = 2)

make this as an parameter?

if no_routing_stop - no_routing_start > self.limit:
self.fails['dut'].add("Downtime must be less then %s seconds" % self.test_params['fast_reboot_limit'])
if no_routing_stop - self.reboot_start > datetime.timedelta(seconds=self.test_params['graceful_limit']):
self.fails['dut'].add("Fast-reboot cycle must be less then graceful limit %s seconds" % self.test_params['graceful_limit'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then [](start = 66, length = 4)

than

self.log("ASIC was stopped, Waiting until it's up. Stop time: %s" % str(no_routing_start))
self.timeout(self.task_timeout, "DUT hasn't started to work for %d seconds" % self.task_timeout)
no_routing_stop = self.check_start()
no_routing_stop, _ = self.check_start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what should be the final exit criteria for fast-reboot? whether we need to check if all peer bgp session are up (exit from graceful restart mode)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I defined the exit criteria as working dataplane. I can check that all bgp session are up at the end of the test.

Copy link
Contributor

@lguohan lguohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as comments

@pavel-shirshov pavel-shirshov merged commit c88c151 into sonic-net:master Jul 27, 2017
@pavel-shirshov pavel-shirshov deleted the pavelsh/fr_update branch July 27, 2017 21:56
lguohan pushed a commit that referenced this pull request Aug 3, 2017
auspham pushed a commit to auspham/sonic-mgmt that referenced this pull request May 30, 2025
…lure (sonic-net#204)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
This PR fixes issue sonic-net#9052, where '_test_default_route_with_bgp_flap_' fails with "_Default route nexthops doesn't match the testbed topology_" issue.

The count of the next hops is not matching with the upstream neighbor as per the topology (reading data from config facts).

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] Test case(new/improvement)

### Back port request
- [ ] 202012
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [x] 202405

### Approach
#### What is the motivation for this PR?
- '_test_default_route_with_bgp_flap_' fails with "_Default route nexthops doesn't match the testbed topology_" issue.
- The count of the next hops is not matching with the upstream neighbor as per the topology.

#### How did you do it?
- Instead of skipping T3 neighbors of AZNGHub, use the address family and next hops information to find out the list of upstream neighbors.

#### How did you verify/test it?
- Ran the above-mentioned test case on a T2 chassis and made sure the test passed without any issues.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
![image](https://github.com/user-attachments/assets/03516f51-63f0-41bb-be9e-eb9ae2c18d92)

![image](https://github.com/user-attachments/assets/0932f1fb-0754-4949-b50d-33a8007d878d)
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
* 664f0e2 2021-07-14 | [xrcvd]: Removed undefined symbol 'sfp_status_helper' (sonic-net#204) (HEAD, origin/202012) [Prince George]
* 1b2d016 2021-06-16 | [CI] sonic-config-engine now depends on SONiC YANG packages (sonic-net#194) [Joe LeVeque]
* 1cf5996 2021-07-14 | Introduce mgmtinit delay after transceiver module insertion (sonic-net#201) [Prince George]

Signed-off-by: Guohan Lu <[email protected]>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
… submodule head (sonic-net#12494)

utilities:
* 02eb899e 2022-07-12 | [config/load_mgmt_config] Support load IPv6 mgmt IP (sonic-net#2206) (sonic-net#2246) (sonic-net#2256) (HEAD -> 201811, github/201811) [Jing Kan]

platform-daemon:
* fc288cc 2022-05-05 | Mem leak caused by Xcvrd in Send-Q of REDIS-DB socket connection (sonic-net#260) (HEAD -> 201811, github/201811) [Prince George]

platform-common:
* edb062b 2022-02-09 | [sonic_sfp] Interpret sff 'int' element =0 as valid value (sonic-net#261) (HEAD -> 201811, github/201811) [Prince George]

kernel:
* 9d2d1a1 2022-02-11 | [201811] Increase log_buf_len size to 1M (HEAD -> 201811, github/201811) [Sujin Kang]
* b34a213 2022-02-10 | [201811] Increase log_buf_len size to 1M [Samuel Angebault]
* c4684cb 2022-02-11 | [201811] Apply kernel patches to fix emmc unreliability [Sujin Kang]
* df68771 2022-02-03 | [201811] Apply kernel patches to fix emmc unreliability [Samuel Angebault]
* f21cb06 2021-03-26 | [ci]: remove 201811 suffix (sonic-net#204) [lguohan]
* 5439b2a 2021-02-08 | [dni_dps460] Add attributes to retrieve PMBus status command codes (sonic-net#194) [Arun Saravanan Balachandran]

Signed-off-by: Ying Xie <[email protected]>

Signed-off-by: Ying Xie <[email protected]>
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
Included commits in sonic-py-swsssdk
```
63c75c1 2021-03-14 | Workaround Mellanox default vlan has no SAI_VLAN_ATTR_VLAN_ID attribute (sonic-net#103) [Qi Luo]
```

Included commits in sonic-snmpagent
```
a8c6e36 2021-03-15 | Implement rfc4363 FdbUpdater for lag inside vlan (sonic-net#204) [Qi Luo]
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants