Merge EVERFLOW test updates by stcheng · Pull Request #1009 · sonic-net/sonic-mgmt

stcheng · 2019-07-15T18:57:15Z

[everflow]: Replace deprecated mirror_session.py file with CLI (#895)
[everflow]: Change the test command from ip route to vtysh (#920)
[everflow]: Remove deprecated tests (#923)
[everflow]: Remove unused variables (#931)
[everflow]: Add pause after route change (#942)
[EVERFLOW]: Add EVERFLOW policer test with DSCP value/mask (#932)
[everflow]: Fix the tearing down procedure order (#988)

Ansible timeout internal variables in cache in 10 minutes by default. However, we have some playbooks run more more than 10 minutes and would lose some ansible variables during execution. The error message wasn't super clear. It would be something like following: fatal: [host_name]: FAILED! => {"failed": true, "msg": "ERROR! 'ansible _Ethernet0' is undefined"} And the line number would point at the next line after the task actually took too long. Warm reboot test is currently taking about 11 minutes to complete and it triggered this caching issue. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>

Fix the following issue: AttributeError: 'module' object has no attribute 'Ether' Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

…tected (sonic-net#805) * [warm/fast reboot] fail fast/warm reboot test if BGP GR timeout is detected Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm/fast reboot] get the actual ASN for printing Signed-off-by: Ying Xie <ying.xie@microsoft.com>

) - Remove topology check in config.yml since testcase is already gating topology list. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

…et#798) The snmp pfc_counter check is to validate if the interfaces have expected attributes. However, when expected attributes are missing, the script failed with error "'dict object' has no attribute". This change improved the error handling using a more robust syntax. Signed-off-by: Xin Wang <xinw@mellanox.com>

The current set of ACL rules will block BGP traffic. All the routes learned via BGP will timeout after the ACL rules are loaded. Then there is no route for the PTF injected packets. Packets accepted by ACL rules would not be forwarded anyway. This update is to add two ACL rules to allow BGP traffic. Signed-off-by: Xin Wang <xinw@mellanox.com>

Signed-off-by: Roman Kachur <romankac@mellanox.com>

) In ARP testing, the script needs to run "ip neigh flush all" couple of times to clean up ARP table. Occasionally flushing ARP table may fail with error "*** Flush not complete bailing out after 10 rounds" on ptf32 or ptf64 topology. The reason is that BGP peers are configured on these topologies although the PTF container does not have BGP running. From time to time, DUT will try to contact the BGP peers. ARP requests are firstly sent out. This will create some INCOMPLETE entries in ARP table for the BGP peers. If "ip neigh flush all" is executed at the same moment, the command may fail with "*** Flush not complete bailing out after 10 rounds". The fix is to simply ignore the error of "ip neigh flush all". Purpose of the "ip neigh flush all" command is to cleanup ARP entries generated by previous testing. It doesn't matter when there are ARP entries generated by other activities not flushed. Meanwhile, I changed the command to "ip -stats neigh flush all" to have more detailed output for debug in case of failure. Signed-off-by: Xin Wang <xinw@mellanox.com>

In logAnalyzer related scripts, ansible_date_time is used for generating testname_unique which will be used as folder name for storing logAnalyzer results. The drawback is that ansible_date_time is fixed in single ansible playbook execution. If logAnalyzer is called multiple times, the earlier results could be overwritten by the latest results. The fix is to replace ansible_date_time with: {{lookup('pipe','date +%Y-%m-%d-%H:%M:%S')}} There are two differences: 1. Each time the lookup plugin is called, the returned date and time would be different (interval of two calls longer than 1 second) 2. The lookup plugin gets date and time of the sonic-mgmt container, not date and time of DUT. Signed-off-by: Xin Wang <xinw@mellanox.com>

…tch (sonic-net#826) 192.168.0.0/21 belongs ot the vlan. Leaf switch should not advertise routes in this segment to carve out IP address ranges from the VLAN. Doing so would cause T1 to VLAN IO validation failure. fast/warm reboot tests depend on the IO validation would also fail. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [arp_responder] only respond to arp request, ignore arp response - Check incoming packet type, if it is ARP response, don't reply again. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm/fast reboot] add vlan arp ping check during warm reboot - Making sure that ARP within vlan continue working during warm reboot Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Define and use ARP_OP_REQUEST

…c-net#833) - Refactoring repeating code to StateMachine class. - Tracking flooding state in StateMachine too. - Also watch VLAN ARP state transitions but not failing on it. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

…t#831) Previous PR sonic-net#822 broke the log analyzer. This commit is to fix the issue introduced by PR sonic-net#822. PR sonic-net#822 is to replace the method of getting timestamp from ansible_date_time to the pipe plugin. Log analyzer needs a variable testname_unique to have timestamp in its value. Script run_command_with_log_analyzer.yml uses include_vars to include variables defined in vars/run_config_test_vars.yml. Ansible includes variables dynamically when 'include_vars' is used. Consequence is that the testname_unique variable defined in the vars/run_config_test_vars.yml is re-evaluated when it is referenced. This caused log analyzer to use inconsistent start and end mark. To fix this issue, a unique timestamp is always generated before include_vars. Then the timestamp is feed to the testname_unique defined in the vars file. This approach can guarantee consistent run id for each log analyzer execution. And the run id would also be different for different log analyzer execution. The other scripts changed in this commit have the similar issue. Signed-off-by: Xin Wang <xinw@mellanox.com>

…ic-net#839) The existing script does not restore lag rate setting on VMs in case of failure. This improvement is to restore lag setting on VMs if the testing failed. Signed-off-by: Xin Wang <xinw@mellanox.com>

…onic-net#842) After applying acltb_test_rules_part_1.json BGP sessions may go down before we apply acltb_test_rules_part_2.json (which had BGP ACL forward rules); This results in BGP flap during ptf test run; It is safer to apply BGP ACL forward rules first to avoid BGP flapping. Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>

…cted (sonic-net#844) PR sonic-net#831 does not fully fix the issue introduced by PR sonic-net#822. Ansible's include_vars module could not override variable value previous defined by set_fact. Variables in vars/run_config_test_vars.yml may still have old value. The change is to avoid using include_vars. The variables defined in run_config_test_vars.yml are moved into script run_command_with_log_analyzer.yml. The vars files are deleted. The same change is made to other scripts using the same pattern. Signed-off-by: Xin Wang <xinw@mellanox.com>

If use apt_key module for getting docker official GPG key, there would be cert validation issue. Replace the apt_key module with 'curl' command recommended on docker official documentation site.

… PTF container (sonic-net#836) The PTF container will be destroyed if testbed-cli.sh remove-topo is executed. Run testbed-cli.sh add-topo will add a new PTF conainer. Usually the new PTF container will have a new MAC address. If add-topo is executed immediately after remove-topo, ARP table of neighbor switches and hosts may still have entry of the old PTF MAC address. This would cause connectivity issue to the new PTF container for a while until the old PTF MAC address is expired. This workaround is to send out an ARPing from the PTF container querying mgmt_gw after new PTF container is deployed and attached to network. The ARPing request will be broadcasted to all neighbors on same LAN and will refresh ARP table of neighbors with new MAC address of new PTF. Signed-off-by: Xin Wang <xinw@mellanox.com>

…t#857) Otherwise, if 2 systems have names where one is prefix of the other one, parsing of the shorter name will come up with 2 lines. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

sonic-net/sonic-utilities#504 This is to make all the commands backwards compatible Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

…sonic-net#823) * [ptf_runner] Save ptf log to script executing host in case of failure The PTF log and pcap files are useful for debugging in case of PTF script failed. However, these files are in the PTF container and could be lost when the PTF container is re-deployed. This improvement is to save the log and pcap files to the script executing host when the PTF script is failed. Signed-off-by: Xin Wang <xinw@mellanox.com> * [ptf_runner] Add option for specifying whether to save ptf log The previous commit changed the default behavior. This change is to add an option for specifying whether to save ptf log in case of failure. For example: ansible-playbook <some_test>.yml ... -e save_ptf_log=yes

…hange (sonic-net#866) PFC_WD_TABLE --> PFC_WD Signed-off-by: Ying Xie <ying.xie@microsoft.com>

…nic-net#875) Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

* [fast/warm reboot] improve new image installation code - Allow new_sonic_image being defined as empty string. It causes skipping image installation. - Rename new_image_location to a generic name. - Display defined new image url. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [fast/warm reboot] allow DUT to stay in the warm/fast reboot target release This feature is needed in order to test ugprade path. Where we might upgrade from one version to another, and more. We want the system to stay in target release for next steps. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Address review comments, test issues and some minor touch-ups Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [fast/warm reboot] add knob to clean up old iamges on DUT before warm/fast reboot When new image is specified for fast/warm reboot. The new image will be installed. However, if the specified image is already installed on the target DUT, then sonic_install will fail and fast/warm reboot will reboot into current image. Add a knob to cleanup old images so that the installing of new image will have a better chance to succeed. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * address review issue

…les (sonic-net#865) The ntpd may generate 'ERR ntpd' in syslog and caused unnecessary test case failure. Previous PR sonic-net#816 added a matching pattern of 'ERR ntpd' in loganalyzer igonre files to ignore the ntpd error messages. However, ntpd may generate two formats of error messages. The previously added matching pattern can only match one of the formats. This change is to update the pattern to match both of the formats. Signed-off-by: Xin Wang <xinw@mellanox.com>

* Add many testcases support to t0-56 * Fix bgp_speaker for t0-56

…c-net#930) * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml

…CK packets (sonic-net#924)

* Extend warm-reboot test to include the BGP sad pass

…net#893) * Do not crash in case data plane never stop on fast-reboot

* preboot LAG sad path automation for neigh_lag_down and dut_lag_down scenarios

…y smaller values (sonic-net#946)

* Fix testbed_mtu for tasks that invoke fib_test * Set socket buffer size to 16k

) - Improve the data test warm up code: Let the data plane IO stablize for 30 seconds before testing. We observed ptf instability causing the test to fail. - Remove config_db.json when fast-reboot into a new image. We want the new image to reload minigraph in this case. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [fdb_mac_expire.yml]: FDB MAC Expire test case. [fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table. [fdb.yml]: include fdb_mac_expire.yml. This test case verifies that MAC expires within 10 mins if traffic is not flowing using it. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: FDB MAC Expire test case. [fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table. [testcases.yml]: include fdb_mac_expire.yml. This test case verifies that MAC expires within 10 mins if traffic is not flowing using it. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Incorporate swssconfig step to set fdb_aging_timer in fdb_mac_expire.yml Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: minor changes in logs Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: minor log changes to show time correctly. Example: "MAC Entires are Cleared within 100 secs." instead of "MAC Entires are Cleared within 2*50 secs." Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Address review comments related to sonic-clear, -it option and block-always. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Change "sonic-clear fdb all" to "Clear FDB table". Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

…ore rebooting (sonic-net#975) - fast-reboot script is an adapted version from 201811 branch. The change is around syncd stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown. - Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob allow_vlan_flooding to ignore this symptom and continue with fast-reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

…analyzer.yml (sonic-net#963) The copy files task was after the fail tests. In case of failure, the copy task would never get a chance to run. This commit adjusted the task sequence. In case of failure, copy the files, then fail the test. The original copy task copies files with deep folder structure. This issue was also fixed in this commit. Signed-off-by: Xin Wang <xinw@mellanox.com>

…sonic-net#968) * fix grep ipv6 addr issue * Add Mellanox onyx fanout switch deploy yml and template * fix typo * remove debug code * revert the change to check_pfcwd_fanout.yml and deploy_pfcwd_fanout.yml * fix typo

…om neigh logs (sonic-net#974)

Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>

…sonic-net#997) By default the log analyzer generate a dump which collect all the available log files by default in case of failure. This unnecessary and the dump file could be too big. This fix is to generate a dump to collect log within 1 hour by default. If more log is needed, parameter 'dump_since' can be used. Signed-off-by: Xin Wang <xinw@mellanox.com>

) * Upgrade FW for mellanox before fast-reboot * Move some condition check to the main file

) * [warm/fast reboot] make sure that /etc/sonic/config_db.json exsits after upgrade Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm reboot] save config after warm reboot into new image When new image is defined, test removed /host/config_db.json before warm rebooting. So after the device boots up, it will miss /etc/sonic/config_db.json. It is not an issue for the device to stay up. But it will be an issue when device reboot again (cold or fast). Signed-off-by: Ying Xie <ying.xie@microsoft.com> * review comments

…-net#895) mirror_session.py file is deprecated; use config mirror_session command instead Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

…#920) Due to the current issues with ip route change with FRR, change all the ip route commands to vtysh commands. Remove the current testcase_6 since it's overlapped with testcase_8. Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

Stablize the test by adding pause after the route change Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

…#932) Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

…d_services.py (sonic-net#1009) ### Description of PR Summary: The test looks for the presense of the telemetry image, and if found tries to start the container. A change upstream resulted in the telemetry container no longer being started by default, so the presense of the image does not imply the container is running. We could change this test to manually start telemetry, but I think a better solution is to move to gNMI whose image is always present and hsould always be started. Fixes # (issue) ### Type of change  - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request I'm not 100% if this is needed in 05, but running this in 05 will not break the test. So either the test is failing and will now pass, or it's passing and will continue to pass - [x] 202505 - [x] 202511 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? Manually ran the test. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation  Co-authored-by: Ronan Mac Fhlannchadha <ronan@nexthop.io>

…d_services.py (sonic-net#1009) Summary: The test looks for the presense of the telemetry image, and if found tries to start the container. A change upstream resulted in the telemetry container no longer being started by default, so the presense of the image does not imply the container is running. We could change this test to manually start telemetry, but I think a better solution is to move to gNMI whose image is always present and hsould always be started. Fixes # (issue)  - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement I'm not 100% if this is needed in 05, but running this in 05 will not break the test. So either the test is failing and will now pass, or it's passing and will continue to pass - [x] 202505 - [x] 202511 Manually ran the test.

### Description of PR The current calculation method for the RX PPS rate for the COPP tests is not very accurate (in some cases 130-150% above nominal) due to the reliance of sampling received packets at the ptf container on the testbed server after sending the packet stream, while also using a timeout to wait for said packets to finish arriving. This does not capture the in-flight RX PPS rate, but rather takes an average outside of the actual packet transmission window, and incurs additional inaccuracies due to the wait time at the end. A more accurate approach implemented here is to take two snapshots of the RX packet count at the NN agent on the dut itself while the packet stream is already running, and calculate the difference. Summary: Fixes # (issue) ### Type of change - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [x] 202412 - [x] 202511 ### Approach #### What is the motivation for this PR? To fix neighbor_miss tests failing for TH5 duts on 202412, and enhance the COPP tests overall for more accurate results. #### How did you do it? Updated the calculation method used to get the RX PPS rate for the COPP tests. #### How did you verify/test it? Ran the copp tests and verified that the resulting RX PPS values were within range. --------- Signed-off-by: Christopher Croy <ccroy@arista.com>

yxieca and others added 30 commits February 4, 2019 02:01

[switch] increase spawn connection timeout (sonic-net#788)

56bea85

Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>

[vxlan_decap]: Fix import issue for the scapy module (sonic-net#801)

94e2ae4

Fix the following issue: AttributeError: 'module' object has no attribute 'Ether' Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[ansible] Fix neighbor test case (sonic-net#800)

67d42df

[config test] enable config test case for T0-116 topology (sonic-net#809

1e92b6b

) - Remove topology check in config.yml since testcase is already gating topology list. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

Add 'ERR ntpd' to loganalyzer ignore files (sonic-net#816)

78d53ea

Signed-off-by: Roman Kachur <romankac@mellanox.com>

[lag_rate] Restore lag setting on VMs in case of failure in test (son…

7e0ba68

…ic-net#839) The existing script does not restore lag rate setting on VMs in case of failure. This improvement is to restore lag setting on VMs if the testing failed. Signed-off-by: Xin Wang <xinw@mellanox.com>

[docker] Use recommended CMD for getting docker GPG key (sonic-net#843)

e060b41

If use apt_key module for getting docker official GPG key, there would be cert validation issue. Replace the apt_key module with 'curl' command recommended on docker official documentation site.

Change warm-reboot time limit to 1 second (sonic-net#855)

9d54f43

[link state] look up topology name until the separator char (sonic-ne…

42f4860

…t#857) Otherwise, if 2 systems have names where one is prefix of the other one, parsing of the shorter name will come up with 2 lines. Signed-off-by: Ying Xie <ying.xie@microsoft.com>

[test]: Change config interface command arguments order (sonic-net#864)

d5b5415

sonic-net/sonic-utilities#504 This is to make all the commands backwards compatible Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[pfc_wd] change pfc watchdog table name according to the sonic code c…

dc7d58f

…hange (sonic-net#866) PFC_WD_TABLE --> PFC_WD Signed-off-by: Ying Xie <ying.xie@microsoft.com>

Scrub credential in docker pull command line (sonic-net#869)

0bc2acf

[minigraph]: Fix minigraph parsing error on Mellanox-SN2700-D48C8 (so…

336703a

…nic-net#875) Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>

Add many testcases support to t0-56 (sonic-net#885)

8ac9a88

* Add many testcases support to t0-56 * Fix bgp_speaker for t0-56

yvolynets-mlnx and others added 25 commits June 4, 2019 17:32

Move image processing from advanced-reboot.yml to separate file (soni…

ad550fe

…c-net#930) * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml

[dhcp_relay] More detailed crafting and strict testing of OFFER and A…

80e2c2b

…CK packets (sonic-net#924)

Extend warm-reboot test to include the BGP sad path (sonic-net#926)

57a0ab2

* Extend warm-reboot test to include the BGP sad pass

Fix python crash in case data plane never stop on fast-reboot (sonic-…

84f4a6f

…net#893) * Do not crash in case data plane never stop on fast-reboot

[warm-reboot] Add preboot LAG sad path automation (sonic-net#945)

84cd691

* preboot LAG sad path automation for neigh_lag_down and dut_lag_down scenarios

Default to use jumbo frames for this test. MTU is configurable for an…

3f30920

…y smaller values (sonic-net#946)

Fix testbed_mtu for tasks that invoke fib_test (sonic-net#964)

c5846e1

* Fix testbed_mtu for tasks that invoke fib_test * Set socket buffer size to 16k

Remove fast-reboot script and related changes (sonic-net#982)

e52117b

[warm-reboot] Fix the issue where BGP info was not being extracted fr…

e7823f1

…om neigh logs (sonic-net#974)

Improved link flap test: added smart timeout. (sonic-net#977)

a888eff

Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>

[fast-reboot] Upgrade FW for mellanox before fast-reboot (sonic-net#1000

e7bb1fd

) * Upgrade FW for mellanox before fast-reboot * Move some condition check to the main file

[everflow]: Replace deprecated mirror_session.py file with CLI (sonic…

b37eeab

…-net#895) mirror_session.py file is deprecated; use config mirror_session command instead Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[everflow]: Remove deprecated tests (sonic-net#923)

9057b93

[everflow]: Remove unused variables (sonic-net#931)

922b269

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[everflow]: Add pause after route change (sonic-net#942)

d0440a7

Stablize the test by adding pause after the route change Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[EVERFLOW]: Add EVERFLOW policer test with DSCP value/mask (sonic-net…

a3fb8ea

…#932) Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

[everflow]: Fix the tearing down procedure order (sonic-net#988)

22a1721

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>

stcheng closed this Jul 15, 2019

stcheng deleted the 201811 branch July 15, 2019 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge EVERFLOW test updates#1009

Merge EVERFLOW test updates#1009
stcheng wants to merge 68 commits intosonic-net:masterfrom
stcheng:201811

stcheng commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants

Conversation

stcheng commented Jul 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

16 participants