Skip to content

[sequential restart] making swss restart test failing as it should be#1991

Merged
yxieca merged 1 commit intomasterfrom
restart
Jul 29, 2020
Merged

[sequential restart] making swss restart test failing as it should be#1991
yxieca merged 1 commit intomasterfrom
restart

Conversation

@yxieca
Copy link
Collaborator

@yxieca yxieca commented Jul 29, 2020

Summary:
Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

What is the motivation for this PR?

Test should have failed due to critical process crash (image issue). But test failed to detect it and left system in unhealthy state.

How did you do it?

swss restart test discovers an image issue that would cause orchagent to crash. As result, the test should fail instead passing and leaving the system in a bad state.

This PR addressed the test false negative issue. The 'leaving system unhealthy' part will be addressed by a subsequent PR.

Signed-off-by: Ying Xie [email protected]

How did you verify/test it?

yinxi@acs-trusty8:/var/src/sonic-mgmt/tests$ ./run_tests.sh -d str-dx010-acs-1 -n vms3-t1-dx010-1 -i /var/src/sonic-mgmt/ansible/str,/var/src/sonic-mgmt/ansible/veos -u -p /tmp/logs-01 -c platform_tests/test_sequential_restart.py::test_restart_swss
=== Running tests in groups ===
================================================================================== test session starts ===================================================================================
platform linux2 -- Python 2.7.12, pytest-4.6.5, py-1.8.1, pluggy-0.13.1
ansible: 2.8.7
rootdir: /var/src/sonic-mgmt/tests, inifile: pytest.ini
plugins: ansible-2.2.2, xdist-1.28.0, forked-1.1.3, repeat-0.8.0
collected 1 item

platform_tests/test_sequential_restart.py::test_restart_swss FAILED [100%]

======================================================================================== FAILURES ========================================================================================
___________________________________________________________________________________ test_restart_swss ____________________________________________________________________________________

duthost = <tests.common.devices.SonicHost object at 0x7f8abb159950>, localhost = <tests.common.devices.Localhost object at 0x7f8abb1592d0>
conn_graph_facts = {'device_conn': {'Ethernet0': {'peerdevice': u'str-7060cx-32s-21', 'peerport': u'Ethernet1/1', 'speed': u'100000'}, 'E...ss', 'vlanids': u'2006', 'vlanlist': [2006]}, ...}, 'device_vlan_list': [1981, 1979, 1980, 2006, 2004, 2005, ...], ...}

def test_restart_swss(duthost, localhost, conn_graph_facts):
    """
    @summary: This test case is to restart the swss service and check platform status
    """
  restart_service_and_check(localhost, duthost, "swss", conn_graph_facts["device_conn"])

conn_graph_facts = {'device_conn': {'Ethernet0': {'peerdevice': u'str-7060cx-32s-21', 'peerport': u'Ethernet1/1', 'speed': u'100000'}, 'E...ss', 'vlanids': u'2006', 'vlanlist': [2006]}, ...}, 'device_vlan_list': [1981, 1979, 1980, 2006, 2004, 2005, ...], ...}
duthost = <tests.common.devices.SonicHost object at 0x7f8abb159950>
localhost = <tests.common.devices.Localhost object at 0x7f8abb1592d0>

platform_tests/test_sequential_restart.py:61:


platform_tests/test_sequential_restart.py:54: in restart_service_and_check
check_critical_processes(dut, 60)


dut = <tests.common.devices.SonicHost object at 0x7f8abb159950>, watch_secs = 60

def check_critical_processes(dut, watch_secs=0):
    """
    @summary: check all critical processes. They should be all running.
              keep on checking every 5 seconds until watch_secs drops below 0.
    @param dut: The AnsibleHost object of DUT. For interacting with DUT.
    @param watch_secs: all processes should remain healthy for watch_secs seconds.
    """
    logging.info("Check all critical processes are healthy for {} seconds".format(watch_secs))
    while watch_secs >= 0:
        status, details = _get_critical_processes_status(dut)
      pytest_assert(status, "Not all critical processes are healthy: {}".format(details))

E Failed: Not all critical processes are healthy: {'lldp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'lldp-syncd', u'lldpd', u'lldpmgrd']}, 'pmon': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'psud', u'xcvrd']}, 'database': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'redis']}, 'snmp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'snmp-subagent', u'snmpd']}, 'bgp': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'bgpcfgd', u'bgpd', u'fpmsyncd', u'staticd', u'zebra']}, 'teamd': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'teammgrd', u'teamsyncd']}, 'syncd': {'status': True, 'exited_critical_process': [], 'running_critical_process': [u'syncd']}, 'swss': {'status': False, 'exited_critical_process': [u'orchagent'], 'running_critical_process': [u'buffermgrd', u'intfmgrd', u'nbrmgrd', u'neighsyncd', u'portmgrd', u'portsyncd', u'vlanmgrd', u'vrfmgrd', u'vxlanmgrd']}}

details = {'bgp': {'exited_critical_process': [], 'running_critical_process': ['bgpcfgd', 'bgpd', 'fpmsyncd', 'staticd', 'zebra'...s': True}, 'pmon': {'exited_critical_process': [], 'running_critical_process': ['psud', 'xcvrd'], 'status': True}, ...}
dut = <tests.common.devices.SonicHost object at 0x7f8abb159950>
status = False
watch_secs = 60

common/platform/processes_utils.py:37: Failed
==================================================================================== warnings summary ====================================================================================
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/stepwise
self.warn("could not create cache path {path}", path=path)

/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/nodeids
self.warn("could not create cache path {path}", path=path)

/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127
/usr/local/lib/python2.7/dist-packages/_pytest/cacheprovider.py:127: PytestCacheWarning: could not create cache path /var/src/sonic-mgmt/tests/.pytest_cache/v/cache/lastfailed
self.warn("could not create cache path {path}", path=path)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
------------------------------------------------------------------------ generated xml file: /tmp/logs-01/tr.xml -------------------------------------------------------------------------
================================================================================ short test summary info =================================================================================
FAILED platform_tests/test_sequential_restart.py::test_restart_swss - Failed: Not all critical processes are healthy: {'lldp': {'status': True, 'exited_critical_process': [], 'running...
========================================================================= 1 failed, 3 warnings in 242.44 seconds =========================================================================

swss restart test discovers an image issue that would cause orchagent
to crash. As result, the test should fail instead passing and leaving
the system in a bad state.

This PR addressed the test false negative issue. The 'leaving system
unhealthy' part will be addressed by a subsequent PR.

Signed-off-by: Ying Xie <[email protected]>
@yxieca yxieca requested review from a team and stephenxs July 29, 2020 16:56
status, _ = _get_critical_processes_status(dut)
return status

def check_critical_processes(dut, watch_secs=0):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not using the existing 'all_critical_process_status' in devices.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are using it. See line 14 :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. My bad

logging.info("Check all critical processes are healthy for {} seconds".format(watch_secs))
while watch_secs >= 0:
status, details = _get_critical_processes_status(dut)
pytest_assert(status, "Not all critical processes are healthy: {}".format(details))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be just logging error since we want this loop to continue?

Copy link
Collaborator Author

@yxieca yxieca Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this loop is to make sure that there is no critical process failure for 60 seconds (or a spot check if watch_secs == 0). If there is a failure, then the test should fail.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have in mind is the other method: wait_critical_process()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got confused with the wait_critical_process approach. It is clear now

@yxieca yxieca merged commit 56b36e2 into master Jul 29, 2020
@yxieca yxieca deleted the restart branch July 29, 2020 18:34
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
swss
73caba3 Allow interface type value none (sonic-net#1991)

utilities
32e530f Allow interface type value none (sonic-net#1902)
53f066c Fix log_ssd_health hang issue (sonic-net#1904)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants