Fix python crash in case data plane never stop on fast-reboot#893
Fix python crash in case data plane never stop on fast-reboot#893pavel-shirshov merged 5 commits intosonic-net:masterfrom pyadvichuk:pyadvychuk/fast-reboot-fix
Conversation
Should we treat this as a bug on 'some' platform? |
The problem appeared to be present even on platforms having rich CPU ( >= 4cores) and from my understanding has a bit more relation to the also, the issue I've faced time to time is very similar to that one mentioned in the testcase code In my investigation, it seems that packets sent in a loop will be stacked inside one of the system buffers due to huge virtualization (python-vm) and then are pushed out to the network as a big burst. Inserting 1ms delay we just re-schedule processes queue and push packets to the network one by one. |
|
Can you retry with latest master branch with #890? This change addressed some IO conflicts by setting proper filters. Please see if you still need to add the sleep with this change? If you still have to add the sleep in the end, please add a configuration entry to disable sleep by default. On your testbed, you can pass in parameter to enable the sleep. |
on our setup |
| if no_routing_stop != datetime.datetime.min: | ||
| self.log("Reboot time was %s" % str(no_routing_stop - self.reboot_start)) | ||
| else: | ||
| self.log("Reboot time was minimal") |
There was a problem hiding this comment.
I'd put here: Failed to record reboot time.
There was a problem hiding this comment.
I'm afraid that Failed keyword would confuse a watcher. Another cause to keep the message - it is possible to use some external regexp to parse CT results. In this case we will not break the regex.
Any way - the situation just show a particular datapline feature, not the issue.
Let me know if you insist on the change.
There was a problem hiding this comment.
I think it's impossible for us to measure data plane disruption time with such good granularity.
Also you're using datetime.min here as a marker of timeout while waiting of dataplane start/stop.
So I would suggest don't use this marker value for anything.
There was a problem hiding this comment.
just added one more variable indicating the fact that dataplane never stopped
There was a problem hiding this comment.
I'd suggest to improve this PR:
- Remove time.sleep. We want to send packets as fast as possible
- If a goal of this PR to fix the crash, let's do that: we need to define upper_replies in case of timeout exception.
Also I think as soon as we have timeout exception, we should fail the test
- I have removed time.sleep
- I introduced additional logic (variable 'routing_always') to be able to print "Reboot time was" message correctly
- I don't agree to fail the test in case 'timeout exception' as it just indicates the fact that the dataplane never went down. Nothing criminal from the
fast-rebootcase perspective
There was a problem hiding this comment.
- Thanks
-
- The dataplane must gone done in case of fast-reboot.
There was a problem hiding this comment.
Could you please propose a way how to detect small periods using PTF? We need this because traffic disruption in our case is ~70ms. So, dataplane goes down but PTF can't detect such small interval. The current test implementation is able to detect such small interval once per 4..5 tries (but Ixia, for example, shows disruption reliably). In this PR I've tried to set intervals less than some minimum period to "0:00:00"
pavel-shirshov
left a comment
There was a problem hiding this comment.
Please address my comments.
|
I'd suggest to improve this PR:
|
|
@yxieca @pavel-shirshov is there anything blocks us to get this merged? |
* Do not crash in case data plane never stop on fast-reboot
…nic-net#893) What is the motivation for this PR? Add tests for high frequency telemetry How did you do it? Add new test cases How did you verify/test it? Check in the sn5600 platform locally
If for some reason data plane never stop on fast reboot the script will
fail bc of some undefined variables.
Description of PR
swsswas not fast enough to insert all the FDB entries.Have tested the CT with
ReloadTest::max_nr_vl_pkts == 1000and verified that it passed.Summary:
Fixes # (issue)
Type of change
Approach
How did you do it?
Changed Python script
How did you verify/test it?
Run
fast-rebootCT and verify it will pass even in caseReloadTest::max_nr_vl_pkts == 1000Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation
@yxieca please review and merge this