[sequential restart] making swss restart test failing as it should be by yxieca · Pull Request #1991 · sonic-net/sonic-mgmt

yxieca · 2020-07-29T16:56:34Z

Summary:
Fixes # (issue)

Type of change

Bug fix
Testbed and Framework(new/improvement)
Test case(new/improvement)

Approach

What is the motivation for this PR?

Test should have failed due to critical process crash (image issue). But test failed to detect it and left system in unhealthy state.

How did you do it?

swss restart test discovers an image issue that would cause orchagent to crash. As result, the test should fail instead passing and leaving the system in a bad state.

This PR addressed the test false negative issue. The 'leaving system unhealthy' part will be addressed by a subsequent PR.

Signed-off-by: Ying Xie [email protected]

How did you verify/test it?

yinxi@acs-trusty8:/var/src/sonic-mgmt/tests$ ./run_tests.sh -d str-dx010-acs-1 -n vms3-t1-dx010-1 -i /var/src/sonic-mgmt/ansible/str,/var/src/sonic-mgmt/ansible/veos -u -p /tmp/logs-01 -c platform_tests/test_sequential_restart.py::test_restart_swss
=== Running tests in groups ===
================================================================================== test session starts ===================================================================================
platform linux2 -- Python 2.7.12, pytest-4.6.5, py-1.8.1, pluggy-0.13.1
ansible: 2.8.7
rootdir: /var/src/sonic-mgmt/tests, inifile: pytest.ini
plugins: ansible-2.2.2, xdist-1.28.0, forked-1.1.3, repeat-0.8.0
collected 1 item

platform_tests/test_sequential_restart.py::test_restart_swss FAILED [100%]

======================================================================================== FAILURES ========================================================================================
___________________________________________________________________________________ test_restart_swss ____________________________________________________________________________________

duthost = <tests.common.devices.SonicHost object at 0x7f8abb159950>, localhost = <tests.common.devices.Localhost object at 0x7f8abb1592d0>
conn_graph_facts = {'device_conn': {'Ethernet0': {'peerdevice': u'str-7060cx-32s-21', 'peerport': u'Ethernet1/1', 'speed': u'100000'}, 'E...ss', 'vlanids': u'2006', 'vlanlist': [2006]}, ...}, 'device_vlan_list': [1981, 1979, 1980, 2006, 2004, 2005, ...], ...}

def test_restart_swss(duthost, localhost, conn_graph_facts):
    """
    @summary: This test case is to restart the swss service and check platform status
    """

  restart_service_and_check(localhost, duthost, "swss", conn_graph_facts["device_conn"])

conn_graph_facts = {'device_conn': {'Ethernet0': {'peerdevice': u'str-7060cx-32s-21', 'peerport': u'Ethernet1/1', 'speed': u'100000'}, 'E...ss', 'vlanids': u'2006', 'vlanlist': [2006]}, ...}, 'device_vlan_list': [1981, 1979, 1980, 2006, 2004, 2005, ...], ...}
duthost = <tests.common.devices.SonicHost object at 0x7f8abb159950>
localhost = <tests.common.devices.Localhost object at 0x7f8abb1592d0>

platform_tests/test_sequential_restart.py:61:

platform_tests/test_sequential_restart.py:54: in restart_service_and_check
check_critical_processes(dut, 60)

dut = <tests.common.devices.SonicHost object at 0x7f8abb159950>, watch_secs = 60

def check_critical_processes(dut, watch_secs=0):
    """
    @summary: check all critical processes. They should be all running.
              keep on checking every 5 seconds until watch_secs drops below 0.
    @param dut: The AnsibleHost object of DUT. For interacting with DUT.
    @param watch_secs: all processes should remain healthy for watch_secs seconds.
    """
    logging.info("Check all critical processes are healthy for {} seconds".format(watch_secs))
    while watch_secs >= 0:
        status, details = _get_critical_processes_status(dut)

      pytest_assert(status, "Not all critical processes are healthy: {}".format(details))

E Failed: Not all critical processes are healthy: {'lldp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'lldp-syncd', u'lldpd', u'lldpmgrd']}, 'pmon': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'psud', u'xcvrd']}, 'database': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'redis']}, 'snmp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'snmp-subagent', u'snmpd']}, 'bgp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'bgpcfgd', u'bgpd', u'fpmsyncd', u'staticd', u'zebra']}, 'teamd': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'teammgrd', u'teamsyncd']}, 'syncd': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'syncd']}, 'swss': {'status': False, 'exited_critical_process': [u'orchagent'], 'running_critical_process': [u'buffermgrd', u'intfmgrd', u'nbrmgrd', u'neighsyncd', u'portmgrd', u'portsyncd', u'vlanmgrd', u'vrfmgrd', u'vxlanmgrd']}}

details = {'bgp': {'exited_critical_process': [], 'running_critical_process': ['bgpcfgd', 'bgpd', 'fpmsyncd', 'staticd', 'zebra'...s': True}, 'pmon': {'exited_critical_process': [], 'running_critical_process': ['psud', 'xcvrd'], 'status': True}, ...}
dut = <tests.common.devices.SonicHost object at 0x7f8abb159950>
status = False
watch_secs = 60

common/platform/processes_utils.py:37: Failed
==================================================================================== warnings summary ====================================================================================
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/stepwise
self.warn("could not create cache path {path}", path=path)

/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/nodeids
self.warn("could not create cache path {path}", path=path)

/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/lastfailed
self.warn("could not create cache path {path}", path=path)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
------------------------------------------------------------------------ generated xml file: /tmp/logs-01/tr.xml -------------------------------------------------------------------------
================================================================================ short test summary info =================================================================================
FAILED platform_tests/test_sequential_restart.py::test_restart_swss - Failed: Not all critical processes are healthy: {'lldp': {'status': True, 'exited_critical_process': [], 'running...
========================================================================= 1 failed, 3 warnings in 242.44 seconds =========================================================================

swss restart test discovers an image issue that would cause orchagent to crash. As result, the test should fail instead passing and leaving the system in a bad state. This PR addressed the test false negative issue. The 'leaving system unhealthy' part will be addressed by a subsequent PR. Signed-off-by: Ying Xie <[email protected]>

neethajohn · 2020-07-29T17:11:55Z

tests/common/platform/processes_utils.py

+    status, _ = _get_critical_processes_status(dut)
+    return status
+
+def check_critical_processes(dut, watch_secs=0):


Why are we not using the existing 'all_critical_process_status' in devices.py?

We are using it. See line 14 :-)

ah I see. My bad

neethajohn · 2020-07-29T17:32:46Z

tests/common/platform/processes_utils.py

+    logging.info("Check all critical processes are healthy for {} seconds".format(watch_secs))
+    while watch_secs >= 0:
+        status, details = _get_critical_processes_status(dut)
+        pytest_assert(status, "Not all critical processes are healthy: {}".format(details))


shouldn't this be just logging error since we want this loop to continue?

The purpose of this loop is to make sure that there is no critical process failure for 60 seconds (or a spot check if watch_secs == 0). If there is a failure, then the test should fail.

What you have in mind is the other method: wait_critical_process()

I got confused with the wait_critical_process approach. It is clear now

swss 73caba3 Allow interface type value none (sonic-net#1991) utilities 32e530f Allow interface type value none (sonic-net#1902) 53f066c Fix log_ssd_health hang issue (sonic-net#1904)

yxieca added Bug 🐛 Pytest 🐍 labels Jul 29, 2020

yxieca requested review from a team and stephenxs July 29, 2020 16:56

neethajohn reviewed Jul 29, 2020

View reviewed changes

neethajohn approved these changes Jul 29, 2020

View reviewed changes

yxieca merged commit 56b36e2 into master Jul 29, 2020

yxieca deleted the restart branch July 29, 2020 18:34

yxieca mentioned this pull request Jul 29, 2020

restart_swss test cause orchagent to crash on some platform(s) sonic-net/sonic-buildimage#5064

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sequential restart] making swss restart test failing as it should be#1991

[sequential restart] making swss restart test failing as it should be#1991
yxieca merged 1 commit intomasterfrom
restart

yxieca commented Jul 29, 2020

Uh oh!

neethajohn Jul 29, 2020

Uh oh!

yxieca Jul 29, 2020

Uh oh!

neethajohn Jul 29, 2020

Uh oh!

neethajohn Jul 29, 2020

Uh oh!

yxieca Jul 29, 2020 •

edited

Loading

Uh oh!

yxieca Jul 29, 2020

Uh oh!

neethajohn Jul 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yxieca commented Jul 29, 2020

Type of change

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Uh oh!

neethajohn Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

yxieca Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

neethajohn Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

neethajohn Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

yxieca Jul 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxieca Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

neethajohn Jul 29, 2020

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yxieca Jul 29, 2020 •

edited

Loading