Skip to content

Update fan related tests for under/over speed check APIs#8587

Merged
yxieca merged 2 commits intosonic-net:masterfrom
spilkey-cisco:spilkey/fan_tolerance
Dec 21, 2023
Merged

Update fan related tests for under/over speed check APIs#8587
yxieca merged 2 commits intosonic-net:masterfrom
spilkey-cisco:spilkey/fan_tolerance

Conversation

@spilkey-cisco
Copy link
Contributor

@spilkey-cisco spilkey-cisco commented Jun 13, 2023

Description

Use vendor customizable fan speed threshold checks in sonic-mgmt tests instead of hard-coded calculations based on pwm/percentage.

Motivation and Context

Fan under/over speed checks should be vendor customizable, since a tolerance based off the pwm/percentage fan speed can easily give false failures, especially for low fan speeds.

How Has This Been Tested?

root@sonic:/home/cisco# echo 10000 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
root@sonic:/home/cisco# echo 2400 > /opt/cisco/etc/fantray0.fan0.rpm
root@sonic:/home/cisco# grep thermalctld /var/log/syslog
<snip>
May 19 05:09:47.763970 sonic WARNING pmon#thermalctld: Fan high speed warning: fantray0.fan0 current speed=91, target speed=20
May 19 05:09:51.129298 sonic INFO pmon#supervisord 2023-05-19 05:09:51,128 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
May 19 05:10:01.347935 sonic INFO pmon#supervisord: thermalctld WARNING:cisco.pacific.thermal.thermal_zone:level minor: fantray0.fan0: pwm 20; motor out of tolerance @ rpm 10000; maximum rpm 2950
May 19 05:10:47.023156 sonic NOTICE pmon#thermalctld: Fan high speed warning cleared: fantray0.fan0 speed is back to normal

…ke delay to wait for fan speeds to settle vendor customizable
Copy link
Contributor

@gechiang gechiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gechiang
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@gechiang
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@shivuv
Copy link

shivuv commented Oct 4, 2023

@gechiang @abdosi : Can this be merged?

@gechiang
Copy link
Contributor

gechiang commented Oct 4, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@spilkey-cisco
Copy link
Contributor Author

Can anyone help with the failure observed in the checks, could not retrieve key state value for port Ethernet8 inside table MUX_CABLE_TABLE? This doesn't appear related to the changes in this PR.

@gechiang
Copy link
Contributor

gechiang commented Oct 4, 2023

Looks like the test failure is associated with a known DUALTOR testbed issue which should be corrected soon by this PR:
#10229
Once the above PR merged, we can try re-run the PR tests to see if it passes... then we can request to merge this PR.

@saiarcot895
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@spilkey-cisco
Copy link
Contributor Author

Is there another known issue in telemetry/test_events.py? AssertionError: No output from PTF docker, thread timed out after 30 seconds

@spilkey-cisco
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Commenter does not have sufficient privileges for PR 8587 in repo sonic-net/sonic-mgmt

@spilkey-cisco
Copy link
Contributor Author

@gechiang do you know why kvmtest might be failing? Was this a known issue? I don't have permissions to restart the tests if so.

@gechiang
Copy link
Contributor

it failed with this testcase failure:

Running test case with Python:
Python 3.8.10
Run: python3 -m pytest telemetry/test_events.py              --inventory=../ansible/veos_vtb --host-pattern=all --testbed=vms-kvm-t0             --testbed_file=../ansible/vtestbed.yaml --log-cli-level=warning --log-file-level=debug             --kube_master=unset --showlocals --assert=plain --show-capture=no -rav             --ignore=ptftests --ignore=acstests --ignore=saitests --ignore=scripts --ignore=k8s --ignore=sai_qualify             --log-file='logs/telemetry/test_events|||2.log' --junit-xml='logs/telemetry/test_events|||2.xml'                --completeness_level=confident --allow_recover --maxfail=1
============================= test session starts ==============================
platform linux -- Python 3.8.10, pytest-7.1.3, pluggy-1.3.0
ansible: 2.9.27
rootdir: /var/src/sonic-mgmt/tests, configfile: pytest.ini
plugins: metadata-3.0.0, xdist-1.28.0, forked-1.6.0, html-3.2.0, repeat-0.9.2, ansible-4.1.0, allure-pytest-2.8.22
collected 1 item

telemetry/test_events.py::test_events[vlab-01] 
-------------------------------- live log call ---------------------------------
23:23:10 __init__.pytest_runtest_call             L0040 ERROR  | Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 1761, in runtest
    self.ihook.pytest_pyfunc_call(pyfuncitem=self)
  File "/usr/local/lib/python3.8/dist-packages/pluggy/_hooks.py", line 493, in __call__
    return self._hookexec(self.name, self._hookimpls, kwargs, firstresult)
  File "/usr/local/lib/python3.8/dist-packages/pluggy/_manager.py", line 115, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 113, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/usr/local/lib/python3.8/dist-packages/pluggy/_callers.py", line 77, in _multicall
    res = hook_impl.function(*args)
  File "/usr/local/lib/python3.8/dist-packages/_pytest/python.py", line 192, in pytest_pyfunc_call
    result = testfunction(**testargs)
  File "/var/src/sonic-mgmt/tests/telemetry/test_events.py", line 55, in test_events
    module.test_event(duthost, gnxi_path, ptfhost, DATA_DIR, validate_yang)
  File "/var/src/sonic-mgmt/tests/./telemetry/events/swss_events.py", line 24, in test_event
    run_test(duthost, gnxi_path, ptfhost, data_dir, validate_yang, shutdown_interface,
  File "/var/src/sonic-mgmt/tests/./telemetry/events/run_events_test.py", line 18, in run_test
    listen_for_events(duthost, gnxi_path, ptfhost, filter_event_regex, op_file,
  File "/var/src/sonic-mgmt/tests/telemetry/telemetry_utils.py", line 111, in listen_for_events
    assert results[0] != "", "No output from PTF docker, thread timed out after {} seconds".format(thread_timeout)
AssertionError: No output from PTF docker, thread timed out after 30 seconds

FAILED                                                                   [100%]INFO:root:Can not get Allure report URL. Please check logs


=================================== FAILURES ===================================
_____________________________ test_events[vlab-01] _____________________________

duthosts = [<MultiAsicSonicHost vlab-01>]
enum_rand_one_per_hwsku_hostname = 'vlab-01'
ptfhost = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
setup_streaming_telemetry = None
localhost = <tests.common.devices.local.Localhost object at 0x7f96bd9c4880>
gnxi_path = '/root/gnxi/'

    @pytest.mark.disable_loganalyzer
    def test_events(duthosts, enum_rand_one_per_hwsku_hostname, ptfhost, setup_streaming_telemetry, localhost, gnxi_path):
        """ Run series of events inside duthost and validate that output is correct
        and conforms to YANG schema
        """
        duthost = duthosts[enum_rand_one_per_hwsku_hostname]
        logger.info("Start events testing")
    
        skip_201911_and_older(duthost)
        do_init(duthost)
    
        # Load all events test code and run
        for file in os.listdir(EVENTS_TESTS_PATH):
            if file.endswith("_events.py"):
                module = __import__(file[:len(file)-3])
>               module.test_event(duthost, gnxi_path, ptfhost, DATA_DIR, validate_yang)

duthost    = <MultiAsicSonicHost vlab-01>
duthosts   = [<MultiAsicSonicHost vlab-01>]
enum_rand_one_per_hwsku_hostname = 'vlab-01'
file       = 'swss_events.py'
gnxi_path  = '/root/gnxi/'
localhost  = <tests.common.devices.local.Localhost object at 0x7f96bd9c4880>
module     = <module 'swss_events' from '/var/src/sonic-mgmt/tests/./telemetry/events/swss_events.py'>
ptfhost    = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
setup_streaming_telemetry = None

telemetry/test_events.py:55: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
telemetry/events/swss_events.py:24: in test_event
    run_test(duthost, gnxi_path, ptfhost, data_dir, validate_yang, shutdown_interface,
        data_dir   = 'logs/telemetry/files'
        duthost    = <MultiAsicSonicHost vlab-01>
        gnxi_path  = '/root/gnxi/'
        ptfhost    = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
        validate_yang = <function validate_yang at 0x7f96beae2d30>
telemetry/events/run_events_test.py:18: in run_test
    listen_for_events(duthost, gnxi_path, ptfhost, filter_event_regex, op_file,
        data_dir   = 'logs/telemetry/files'
        duthost    = <MultiAsicSonicHost vlab-01>
        filter_event_regex = 'sonic-events-swss:if-state'
        gnxi_path  = '/root/gnxi/'
        heartbeat  = False
        json_file  = 'if_state.json'
        op_file    = 'logs/telemetry/files/if_state.json'
        ptfhost    = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
        tag        = 'sonic-events-swss'
        thread_timeout = 30
        trigger    = <function shutdown_interface at 0x7f96bc793d30>
        validate_yang = <function validate_yang at 0x7f96beae2d30>
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

duthost = <MultiAsicSonicHost vlab-01>, gnxi_path = '/root/gnxi/'
ptfhost = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
filter_event_regex = 'sonic-events-swss:if-state'
op_file = 'logs/telemetry/files/if_state.json', thread_timeout = 30

    def listen_for_events(duthost, gnxi_path, ptfhost, filter_event_regex, op_file, thread_timeout):
        cmd = generate_client_cli(duthost=duthost, gnxi_path=gnxi_path, method=METHOD_SUBSCRIBE,
                                  submode=SUBMODE_ONCHANGE, update_count=1, xpath="all[heartbeat=2]",
                                  target="EVENTS", filter_event_regex=filter_event_regex)
        results = [""]
        event_thread = InterruptableThread(target=listen_for_event, args=(ptfhost, cmd, results,))
        event_thread.start()
        event_thread.join(thread_timeout)  # close thread after 30 sec, was not able to find event within reasonable time
>       assert results[0] != "", "No output from PTF docker, thread timed out after {} seconds".format(thread_timeout)
E       AssertionError: No output from PTF docker, thread timed out after 30 seconds

cmd        = 'python /root/gnxi/gnmi_cli_py/py_gnmicli.py -g -t 10.250.0.101 -p 50051 -m subscribe -x all[heartbeat=2] -xt EVENTS -...ode 0 --submode 1 --interval 0 --update_count 1 --create_connections 1 --filter_event_regex sonic-events-swss:if-state'
duthost    = <MultiAsicSonicHost vlab-01>
event_thread = <InterruptableThread(Thread-27, started 140285411542784)>
filter_event_regex = 'sonic-events-swss:if-state'
gnxi_path  = '/root/gnxi/'
op_file    = 'logs/telemetry/files/if_state.json'
ptfhost    = <tests.common.devices.ptf.PTFHost object at 0x7f96bea94340>
results    = ['']
thread_timeout = 30

telemetry/telemetry_utils.py:111: AssertionError
=============================== warnings summary ===============================
../../../../usr/local/lib/python3.8/dist-packages/_yaml/__init__.py:18
  /usr/local/lib/python3.8/dist-packages/_yaml/__init__.py:18: DeprecationWarning: The _yaml extension module is now located at yaml._yaml and its location is subject to change.  To use the LibYAML-based parser and emitter, import from `yaml`: `from yaml import CLoader as Loader, CDumper as Dumper`.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /var/src/sonic-mgmt/tests/logs/telemetry/test_events|||2.xml -
=========================== short test summary info ============================
FAILED telemetry/test_events.py::test_events[vlab-01] - AssertionError: No ou...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
=================== 1 failed, 1 warning in 92.86s (0:01:32) ====================

Which differs from previous test failure due to known DualToR Testbed issue.
Let me trigger another run to see if it was just a glitch.

@spilkey-cisco
Copy link
Contributor Author

@gechiang sonic-mgmt all passed now, is anything else needed to merge?

@spilkey-cisco
Copy link
Contributor Author

@gechiang @prgeor any idea why semgrep is not reporting any results? PR is approved and all other checks have passed.

@gechiang
Copy link
Contributor

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@prgeor prgeor requested a review from keboliu October 17, 2023 16:29
Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@keboliu to review

Copy link
Contributor

@JibinBao JibinBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason of romving test_get_fans_speed_tolerance is that we don't need the api of get_speed_tolerance, right?

Copy link
Contributor

@JibinBao JibinBao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. is_under_speed and is_over_speed are new api? If yes,What version start supportting them?
  2. From now on we don't need the api of get_speed_tolerance, right?

@prgeor
Copy link
Contributor

prgeor commented Nov 20, 2023

@prgeor Can you help merge this PR?

@wangxin please merge

@spilkey-cisco
Copy link
Contributor Author

Who is able to merge? "Reviewer with write access" approval is satisfied already.

@prgeor
Copy link
Contributor

prgeor commented Nov 28, 2023

@yxieca @wangxin could you please trigger the semgrep check again for some reason its not running and blocking this PR. I don't have permission either...

@yxieca
Copy link
Collaborator

yxieca commented Nov 28, 2023

/semgrep

@yxieca
Copy link
Collaborator

yxieca commented Nov 28, 2023

@yxieca @wangxin could you please trigger the semgrep check again for some reason its not running and blocking this PR. I don't have permission either...

Did you try above method? @prgeor

@gechiang
Copy link
Contributor

@yxieca , looks like it is still stuck on Semgrep...

@yxieca
Copy link
Collaborator

yxieca commented Nov 29, 2023

/Semgrep

@yxieca
Copy link
Collaborator

yxieca commented Nov 29, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@abdosi abdosi self-requested a review December 15, 2023 22:37
@yxieca
Copy link
Collaborator

yxieca commented Dec 15, 2023

/Semgrep

@qiluo-msft
Copy link
Contributor

For semgrep trigger issue, @spilkey-cisco could you rebase the commits to latest master, or merge master to your dev branch? Then there is semgrep config, which will trigger semgrep check properly.

@spilkey-cisco
Copy link
Contributor Author

For semgrep trigger issue, @spilkey-cisco could you rebase the commits to latest master, or merge master to your dev branch? Then there is semgrep config, which will trigger semgrep check properly.

Semgrep passed, but there are failures now that were not hit last time. I'm unable to view any details about the failures; can anyone help to provide failure details if they are related to these changes, or rerun the tests if not?

@gechiang
Copy link
Contributor

For semgrep trigger issue, @spilkey-cisco could you rebase the commits to latest master, or merge master to your dev branch? Then there is semgrep config, which will trigger semgrep check properly.

Semgrep passed, but there are failures now that were not hit last time. I'm unable to view any details about the failures; can anyone help to provide failure details if they are related to these changes, or rerun the tests if not?

rerun the failed test just now. let's see if the rerun passes...

@bpar9
Copy link
Collaborator

bpar9 commented Dec 21, 2023

@yxieca , @gechiang , now that semgrep and the rerun of failed tests also is successful, can we go ahead and merge this? thanks

@yxieca yxieca merged commit 350f4a3 into sonic-net:master Dec 21, 2023
@shivuv
Copy link

shivuv commented Dec 21, 2023

@gechiang @abdosi : Can you please help to cherry pick this into 202205, 202305 and 202311 branches?

@abdosi
Copy link
Contributor

abdosi commented Dec 29, 2023

@gechiang @abdosi : Can you please help to cherry pick this into 202205, 202305 and 202311 branches?

we are not taking this bug fix for 202205. So this test case enhancement will be backported to 202305.

@spilkey-cisco
Copy link
Contributor Author

@gechiang @abdosi Please cherry-pick into 202305. Let me know if a new PR is needed due to merge conflicts.

mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 1, 2024
…ke delay to wait for fan speeds to settle vendor customizable (sonic-net#8587)
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202305: #11482

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.