Skip to content

Fix python crash in case data plane never stop on fast-reboot#893

Merged
pavel-shirshov merged 5 commits intosonic-net:masterfrom
pyadvichuk:pyadvychuk/fast-reboot-fix
Jun 8, 2019
Merged

Fix python crash in case data plane never stop on fast-reboot#893
pavel-shirshov merged 5 commits intosonic-net:masterfrom
pyadvichuk:pyadvychuk/fast-reboot-fix

Conversation

@pyadvichuk
Copy link

If for some reason data plane never stop on fast reboot the script will
fail bc of some undefined variables.

Description of PR

  • Do not crash in case data plane never stop on fast-reboot;
  • Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Have tested the CT with ReloadTest::max_nr_vl_pkts == 1000 and verified that it passed.

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • [] Testbed and Framework(new/improvement)
  • [] Test case(new/improvement)

Approach

How did you do it?

Changed Python script

How did you verify/test it?

Run fast-reboot CT and verify it will pass even in case ReloadTest::max_nr_vl_pkts == 1000

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@yxieca please review and merge this

@qiluo-msft
Copy link
Contributor

Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Should we treat this as a bug on 'some' platform?

@pyadvichuk
Copy link
Author

Insert 1ms interpacket gap as on some platforms have seen that swss was not fast enough to insert all the FDB entries.

Should we treat this as a bug on 'some' platform?

The problem appeared to be present even on platforms having rich CPU ( >= 4cores) and from my understanding has a bit more relation to the swss performance. Actually, this mentioned in the comments and that's why I pointed that have tested even using ReloadTest::max_nr_vl_pkts == 1000

        self.max_nr_vl_pkts = 500 # FIXME: should be 1000.
                                  # But ptf is not fast enough + swss is slow for FDB and ARP entries insertions

also, the issue I've faced time to time is very similar to that one mentioned in the testcase code

    def wait_dut_to_warm_up(self):
        # When the DUT is freshly rebooted, it appears that it needs to warm
        # up towards PTF docker. In practice, I've seen this warm up taking
        # up to ~70 seconds.

In my investigation, it seems that packets sent in a loop will be stacked inside one of the system buffers due to huge virtualization (python-vm) and then are pushed out to the network as a big burst. Inserting 1ms delay we just re-schedule processes queue and push packets to the network one by one.

@yxieca
Copy link
Collaborator

yxieca commented Apr 30, 2019

Can you retry with latest master branch with #890? This change addressed some IO conflicts by setting proper filters. Please see if you still need to add the sleep with this change?

If you still have to add the sleep in the end, please add a configuration entry to disable sleep by default. On your testbed, you can pass in parameter to enable the sleep.

@pyadvichuk
Copy link
Author

Can you retry with latest master branch with #890? This change addressed some IO conflicts by setting proper filters. Please see if you still need to add the sleep with this change?

If you still have to add the sleep in the end, please add a configuration entry to disable sleep by default. On your testbed, you can pass in parameter to enable the sleep.

on our setup reachability_watcher thread crashed doing fast-reboot during the first iteration with #890.
by the way - send_in_background thread also defines the interpacket gap but with no any conditions

            for entry in packets_list:
                time.sleep(interval)
                testutils.send_packet(self, *entry)

if no_routing_stop != datetime.datetime.min:
self.log("Reboot time was %s" % str(no_routing_stop - self.reboot_start))
else:
self.log("Reboot time was minimal")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put here: Failed to record reboot time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that Failed keyword would confuse a watcher. Another cause to keep the message - it is possible to use some external regexp to parse CT results. In this case we will not break the regex.

Any way - the situation just show a particular datapline feature, not the issue.
Let me know if you insist on the change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's impossible for us to measure data plane disruption time with such good granularity.
Also you're using datetime.min here as a marker of timeout while waiting of dataplane start/stop.
So I would suggest don't use this marker value for anything.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just added one more variable indicating the fact that dataplane never stopped

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to improve this PR:

  1. Remove time.sleep. We want to send packets as fast as possible
  2. If a goal of this PR to fix the crash, let's do that: we need to define upper_replies in case of timeout exception.
    Also I think as soon as we have timeout exception, we should fail the test
  1. I have removed time.sleep
  2. I introduced additional logic (variable 'routing_always') to be able to print "Reboot time was" message correctly
  3. I don't agree to fail the test in case 'timeout exception' as it just indicates the fact that the dataplane never went down. Nothing criminal from the fast-reboot case perspective

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Thanks
  2. The dataplane must gone done in case of fast-reboot.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please propose a way how to detect small periods using PTF? We need this because traffic disruption in our case is ~70ms. So, dataplane goes down but PTF can't detect such small interval. The current test implementation is able to detect such small interval once per 4..5 tries (but Ixia, for example, shows disruption reliably). In this PR I've tried to set intervals less than some minimum period to "0:00:00"

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address my comments.

@pavel-shirshov
Copy link
Contributor

I'd suggest to improve this PR:

  1. Remove time.sleep. We want to send packets as fast as possible
  2. If a goal of this PR to fix the crash, let's do that: we need to define upper_replies in case of timeout exception.
    Also I think as soon as we have timeout exception, we should fail the test

@pyadvichuk
Copy link
Author

@yxieca @pavel-shirshov is there anything blocks us to get this merged?

@pavel-shirshov pavel-shirshov merged commit 2b6bf7f into sonic-net:master Jun 8, 2019
yxieca pushed a commit that referenced this pull request Jun 13, 2019
* Do not crash in case data plane never stop on fast-reboot
deerao02 pushed a commit to deerao02/sonic-mgmt that referenced this pull request Dec 18, 2025
…nic-net#893)

What is the motivation for this PR?
Add tests for high frequency telemetry

How did you do it?
Add new test cases

How did you verify/test it?
Check in the sn5600 platform locally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants