Fix python crash in case data plane never stop on fast-reboot by pyadvichuk · Pull Request #893 · sonic-net/sonic-mgmt

pyadvichuk · 2019-04-26T16:19:06Z

If for some reason data plane never stop on fast reboot the script will
fail bc of some undefined variables.

Description of PR

Do not crash in case data plane never stop on fast-reboot;
Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Have tested the CT with ReloadTest::max_nr_vl_pkts == 1000 and verified that it passed.

Summary:
Fixes # (issue)

Type of change

Bug fix
[] Testbed and Framework(new/improvement)
[] Test case(new/improvement)

Approach

How did you do it?

Changed Python script

How did you verify/test it?

Run fast-reboot CT and verify it will pass even in case ReloadTest::max_nr_vl_pkts == 1000

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@yxieca please review and merge this

qiluo-msft · 2019-04-26T23:37:30Z

Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Should we treat this as a bug on 'some' platform?

pyadvichuk · 2019-04-27T02:52:03Z

Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Should we treat this as a bug on 'some' platform?

The problem appeared to be present even on platforms having rich CPU ( >= 4cores) and from my understanding has a bit more relation to the swss performance. Actually, this mentioned in the comments and that's why I pointed that have tested even using ReloadTest::max_nr_vl_pkts == 1000

        self.max_nr_vl_pkts = 500 # FIXME: should be 1000.
                                  # But ptf is not fast enough + swss is slow for FDB and ARP entries insertions

also, the issue I've faced time to time is very similar to that one mentioned in the testcase code

    def wait_dut_to_warm_up(self):
        # When the DUT is freshly rebooted, it appears that it needs to warm
        # up towards PTF docker. In practice, I've seen this warm up taking
        # up to ~70 seconds.

In my investigation, it seems that packets sent in a loop will be stacked inside one of the system buffers due to huge virtualization (python-vm) and then are pushed out to the network as a big burst. Inserting 1ms delay we just re-schedule processes queue and push packets to the network one by one.

yxieca · 2019-04-30T21:25:53Z

Can you retry with latest master branch with #890? This change addressed some IO conflicts by setting proper filters. Please see if you still need to add the sleep with this change?

If you still have to add the sleep in the end, please add a configuration entry to disable sleep by default. On your testbed, you can pass in parameter to enable the sleep.

… interpacket gap

pyadvichuk · 2019-05-02T20:12:53Z

Can you retry with latest master branch with #890? This change addressed some IO conflicts by setting proper filters. Please see if you still need to add the sleep with this change?

If you still have to add the sleep in the end, please add a configuration entry to disable sleep by default. On your testbed, you can pass in parameter to enable the sleep.

on our setup reachability_watcher thread crashed doing fast-reboot during the first iteration with #890.
by the way - send_in_background thread also defines the interpacket gap but with no any conditions

            for entry in packets_list:
                time.sleep(interval)
                testutils.send_packet(self, *entry)

ansible/roles/test/files/ptftests/advanced-reboot.py

pavel-shirshov · 2019-05-06T22:14:00Z

ansible/roles/test/files/ptftests/advanced-reboot.py

+                if no_routing_stop != datetime.datetime.min:
+                    self.log("Reboot time was %s" % str(no_routing_stop - self.reboot_start))
+                else:
+                    self.log("Reboot time was minimal")


I'd put here: Failed to record reboot time.

I'm afraid that Failed keyword would confuse a watcher. Another cause to keep the message - it is possible to use some external regexp to parse CT results. In this case we will not break the regex.

Any way - the situation just show a particular datapline feature, not the issue.
Let me know if you insist on the change.

I think it's impossible for us to measure data plane disruption time with such good granularity.
Also you're using datetime.min here as a marker of timeout while waiting of dataplane start/stop.
So I would suggest don't use this marker value for anything.

just added one more variable indicating the fact that dataplane never stopped

I'd suggest to improve this PR:

Remove time.sleep. We want to send packets as fast as possible

If a goal of this PR to fix the crash, let's do that: we need to define upper_replies in case of timeout exception.
Also I think as soon as we have timeout exception, we should fail the test

I have removed time.sleep

I introduced additional logic (variable 'routing_always') to be able to print "Reboot time was" message correctly

I don't agree to fail the test in case 'timeout exception' as it just indicates the fact that the dataplane never went down. Nothing criminal from the fast-reboot case perspective

Thanks

The dataplane must gone done in case of fast-reboot.

Could you please propose a way how to detect small periods using PTF? We need this because traffic disruption in our case is ~70ms. So, dataplane goes down but PTF can't detect such small interval. The current test implementation is able to detect such small interval once per 4..5 tries (but Ixia, for example, shows disruption reliably). In this PR I've tried to set intervals less than some minimum period to "0:00:00"

pavel-shirshov

Please address my comments.

pavel-shirshov · 2019-05-17T02:53:52Z

I'd suggest to improve this PR:

Remove time.sleep. We want to send packets as fast as possible
If a goal of this PR to fix the crash, let's do that: we need to define upper_replies in case of timeout exception.
Also I think as soon as we have timeout exception, we should fail the test

pyadvichuk · 2019-05-29T11:00:23Z

@yxieca @pavel-shirshov is there anything blocks us to get this merged?

* Do not crash in case data plane never stop on fast-reboot

…nic-net#893) What is the motivation for this PR? Add tests for high frequency telemetry How did you do it? Add new test cases How did you verify/test it? Check in the sn5600 platform locally

Do not crash in case data plane never stop on fast-reboot; Insert 1ms…

44a781a

… interpacket gap

pavel-shirshov reviewed May 6, 2019

View reviewed changes

ansible/roles/test/files/ptftests/advanced-reboot.py Outdated Show resolved Hide resolved

pavel-shirshov reviewed May 6, 2019

View reviewed changes

ansible/roles/test/files/ptftests/advanced-reboot.py Outdated Show resolved Hide resolved

pavel-shirshov reviewed May 6, 2019

View reviewed changes

ansible/roles/test/files/ptftests/advanced-reboot.py Outdated Show resolved Hide resolved

pavel-shirshov reviewed May 6, 2019

View reviewed changes

Pavlo Yadvychuk added 3 commits May 7, 2019 12:59

Fix review comments

f8c8e03

Fix review comments

434154a

Message format

5650d2e

Remove interpacket gap

721d469

pavel-shirshov approved these changes Jun 8, 2019

View reviewed changes

pavel-shirshov merged commit 2b6bf7f into sonic-net:master Jun 8, 2019

yxieca pushed a commit that referenced this pull request Jun 13, 2019

Fix python crash in case data plane never stop on fast-reboot (#893)

84f4a6f

* Do not crash in case data plane never stop on fast-reboot

Conversation

pyadvichuk commented Apr 26, 2019

Description of PR

Type of change

Approach

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

qiluo-msft commented Apr 26, 2019

Uh oh!

pyadvichuk commented Apr 27, 2019

Uh oh!

yxieca commented Apr 30, 2019

Uh oh!

pyadvichuk commented May 2, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pavel-shirshov May 6, 2019

Choose a reason for hiding this comment

Uh oh!

pyadvichuk May 7, 2019

Choose a reason for hiding this comment

Uh oh!

pavel-shirshov May 7, 2019

Choose a reason for hiding this comment

Uh oh!

pyadvichuk May 7, 2019

Choose a reason for hiding this comment

Uh oh!

pyadvichuk May 17, 2019

Choose a reason for hiding this comment

Uh oh!

pavel-shirshov May 31, 2019

Choose a reason for hiding this comment

Uh oh!

pyadvichuk Jun 5, 2019

Choose a reason for hiding this comment

Uh oh!

pavel-shirshov left a comment

Choose a reason for hiding this comment

Uh oh!

pavel-shirshov commented May 17, 2019

Uh oh!

pyadvichuk commented May 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants