Skip to content

Merge EVERFLOW test updates#1009

Closed
stcheng wants to merge 68 commits intosonic-net:masterfrom
stcheng:201811
Closed

Merge EVERFLOW test updates#1009
stcheng wants to merge 68 commits intosonic-net:masterfrom
stcheng:201811

Conversation

@stcheng
Copy link
Contributor

@stcheng stcheng commented Jul 15, 2019

[everflow]: Replace deprecated mirror_session.py file with CLI (#895)
[everflow]: Change the test command from ip route to vtysh (#920)
[everflow]: Remove deprecated tests (#923)
[everflow]: Remove unused variables (#931)
[everflow]: Add pause after route change (#942)
[EVERFLOW]: Add EVERFLOW policer test with DSCP value/mask (#932)
[everflow]: Fix the tearing down procedure order (#988)

yxieca and others added 30 commits February 4, 2019 02:01
Ansible timeout internal variables in cache in 10 minutes by default.
However, we have some playbooks run more more than 10 minutes and would
lose some ansible variables during execution. The error message wasn't
super clear. It would be something like following:

fatal: [host_name]: FAILED! => {"failed": true, "msg": "ERROR! 'ansible
_Ethernet0' is undefined"}

And the line number would point at the next line after the task actually
took too long.

Warm reboot test is currently taking about 11 minutes to complete and
it triggered this caching issue.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
Fix the following issue:
AttributeError: 'module' object has no attribute 'Ether'

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…tected (sonic-net#805)

* [warm/fast reboot] fail fast/warm reboot test if BGP GR timeout is detected

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [warm/fast reboot] get the actual ASN for printing

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
)

- Remove topology check in config.yml since testcase is already gating topology list.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…et#798)

The snmp pfc_counter check is to validate if the interfaces have
expected attributes. However, when expected attributes are missing, the
script failed with error "'dict object' has no attribute". This change
improved the error handling using a more robust syntax.

Signed-off-by: Xin Wang <xinw@mellanox.com>
The current set of ACL rules will block BGP traffic. All the routes
learned via BGP will timeout after the ACL rules are loaded. Then
there is no route for the PTF injected packets. Packets accepted by
ACL rules would not be forwarded anyway. This update is to add two
ACL rules to allow BGP traffic.

Signed-off-by: Xin Wang <xinw@mellanox.com>
Signed-off-by: Roman Kachur <romankac@mellanox.com>
)

In ARP testing, the script needs to run "ip neigh flush all" couple of
times to clean up ARP table. Occasionally flushing ARP table may fail
with error "*** Flush not complete bailing out after 10 rounds" on ptf32
or ptf64 topology.

The reason is that BGP peers are configured on these topologies although
the PTF container does not have BGP running. From time to time, DUT will
try to contact the BGP peers. ARP requests are firstly sent out. This
will create some INCOMPLETE entries in ARP table for the BGP peers. If
"ip neigh flush all" is executed at the same moment, the command may
fail with "*** Flush not complete bailing out after 10 rounds".

The fix is to simply ignore the error of "ip neigh flush all". Purpose
of the "ip neigh flush all" command is to cleanup ARP entries generated
by previous testing. It doesn't matter when there are ARP entries
generated by other activities not flushed.

Meanwhile, I changed the command to "ip -stats neigh flush all" to
have more detailed output for debug in case of failure.

Signed-off-by: Xin Wang <xinw@mellanox.com>
In logAnalyzer related scripts, ansible_date_time is used for generating
testname_unique which will be used as folder name for storing
logAnalyzer results. The drawback is that ansible_date_time is fixed
in single ansible playbook execution. If logAnalyzer is called
multiple times, the earlier results could be overwritten by the latest
results.

The fix is to replace ansible_date_time with:
{{lookup('pipe','date +%Y-%m-%d-%H:%M:%S')}}

There are two differences:
1. Each time the lookup plugin is called, the returned date and time
   would be different (interval of two calls longer than 1 second)
2. The lookup plugin gets date and time of the sonic-mgmt container,
   not date and time of DUT.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…tch (sonic-net#826)

192.168.0.0/21 belongs ot the vlan. Leaf switch should not advertise routes
in this segment to carve out IP address ranges from the VLAN. Doing so would
cause T1 to VLAN IO validation failure. fast/warm reboot tests depend on the
IO validation would also fail.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [arp_responder] only respond to arp request, ignore arp response

- Check incoming packet type, if it is ARP response, don't reply again.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [warm/fast reboot] add vlan arp ping check during warm reboot

- Making sure that ARP within vlan continue working during warm reboot

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Define and use ARP_OP_REQUEST
…c-net#833)

- Refactoring repeating code to StateMachine class.
- Tracking flooding state in StateMachine too.
- Also watch VLAN ARP state transitions but not failing on it.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…t#831)

Previous PR sonic-net#822 broke the log analyzer. This commit is to fix the issue
introduced by PR sonic-net#822.

PR sonic-net#822 is to replace the method of getting timestamp from
ansible_date_time to the pipe plugin. Log analyzer needs a variable
testname_unique to have timestamp in its value.

Script run_command_with_log_analyzer.yml uses include_vars to include
variables defined in vars/run_config_test_vars.yml. Ansible includes
variables dynamically when 'include_vars' is used. Consequence is that
the testname_unique variable defined in the vars/run_config_test_vars.yml
is re-evaluated when it is referenced. This caused log analyzer to use
inconsistent start and end mark.

To fix this issue, a unique timestamp is always generated before
include_vars. Then the timestamp is feed to the testname_unique
defined in the vars file. This approach can guarantee consistent
run id for each log analyzer execution. And the run id would also
be different for different log analyzer execution.

The other scripts changed in this commit have the similar issue.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…ic-net#839)

The existing script does not restore lag rate setting on VMs in case
of failure. This improvement is to restore lag setting on VMs if the
testing failed.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…onic-net#842)

After applying acltb_test_rules_part_1.json BGP sessions may go down
before we apply acltb_test_rules_part_2.json (which had BGP ACL forward
rules); This results in BGP flap during ptf test run;
It is safer to apply BGP ACL forward rules first to avoid BGP flapping.

Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…cted (sonic-net#844)

PR sonic-net#831 does not fully fix the issue introduced by PR sonic-net#822. Ansible's
include_vars module could not override variable value previous defined
by set_fact. Variables in vars/run_config_test_vars.yml may still have
old value.

The change is to avoid using include_vars. The variables defined in
run_config_test_vars.yml are moved into script
run_command_with_log_analyzer.yml. The vars files are deleted.

The same change is made to other scripts using the same pattern.

Signed-off-by: Xin Wang <xinw@mellanox.com>
If use apt_key module for getting docker official GPG key, there would
be cert validation issue. Replace the apt_key module with 'curl' command
recommended on docker official documentation site.
… PTF container (sonic-net#836)

The PTF container will be destroyed if testbed-cli.sh remove-topo is
executed. Run testbed-cli.sh add-topo will add a new PTF conainer.
Usually the new PTF container will have a new MAC address. If add-topo
is executed immediately after remove-topo, ARP table of neighbor
switches and hosts may still have entry of the old PTF MAC address. This
would cause connectivity issue to the new PTF container for a while
until the old PTF MAC address is expired.

This workaround is to send out an ARPing from the PTF container querying
mgmt_gw after new PTF container is deployed and attached to network.
The ARPing request will be broadcasted to all neighbors on same LAN and
will refresh ARP table of neighbors with new MAC address of new PTF.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…t#857)

Otherwise, if 2 systems have names where one is prefix of the other one, parsing of the
shorter name will come up with 2 lines.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
sonic-net/sonic-utilities#504

This is to make all the commands backwards compatible

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…sonic-net#823)

* [ptf_runner] Save ptf log to script executing host in case of failure

The PTF log and pcap files are useful for debugging in case of PTF
script failed. However, these files are in the PTF container and could
be lost when the PTF container is re-deployed.

This improvement is to save the log and pcap files to the script
executing host when the PTF script is failed.

Signed-off-by: Xin Wang <xinw@mellanox.com>

* [ptf_runner] Add option for specifying whether to save ptf log

The previous commit changed the default behavior. This change is to add
an option for specifying whether to save ptf log in case of failure.
For example: ansible-playbook <some_test>.yml ... -e save_ptf_log=yes
…hange (sonic-net#866)

PFC_WD_TABLE --> PFC_WD

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…nic-net#875)

Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
* [fast/warm reboot] improve new image installation code

- Allow new_sonic_image being defined as empty string. It causes skipping image installation.
- Rename new_image_location to a generic name.
- Display defined new image url.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [fast/warm reboot] allow DUT to stay in the warm/fast reboot target release

This feature is needed in order to test ugprade path. Where we might upgrade from one version
to another, and more. We want the system to stay in target release for next steps.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* Address review comments, test issues and some minor touch-ups

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [fast/warm reboot] add knob to clean up old iamges on DUT before warm/fast reboot

When new image is specified for fast/warm reboot. The new image will be installed.
However, if the specified image is already installed on the target DUT, then
sonic_install will fail and fast/warm reboot will reboot into current image.

Add a knob to cleanup old images so that the installing of new image will have a
better chance to succeed.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* address review issue
…les (sonic-net#865)

The ntpd may generate 'ERR ntpd' in syslog and caused unnecessary test
case failure. Previous PR sonic-net#816
added a matching pattern of 'ERR ntpd' in loganalyzer igonre files to
ignore the ntpd error messages. However, ntpd may generate two formats
of error messages. The previously added matching pattern can only match
one of the formats. This change is to update the pattern to match both
of the formats.

Signed-off-by: Xin Wang <xinw@mellanox.com>
* Add many testcases support to t0-56
* Fix bgp_speaker for t0-56
yvolynets-mlnx and others added 25 commits June 4, 2019 17:32
…c-net#930)

* Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml

* Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml
* Extend warm-reboot test to include the BGP sad pass
…net#893)

* Do not crash in case data plane never stop on fast-reboot
* preboot LAG sad path automation for neigh_lag_down and dut_lag_down scenarios
* Fix testbed_mtu for tasks that invoke fib_test

* Set socket buffer size to 16k
)

- Improve the data test warm up code:
  Let the data plane IO stablize for 30 seconds before testing.
  We observed ptf instability causing the test to fail.
- Remove config_db.json when fast-reboot into a new image.
  We want the new image to reload minigraph in this case.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [fdb_mac_expire.yml]: FDB MAC Expire test case.
[fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table.
[fdb.yml]: include fdb_mac_expire.yml.

This test case verifies that MAC expires within 10 mins if traffic
is not flowing using it.

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: FDB MAC Expire test case.
[fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table.
[testcases.yml]: include fdb_mac_expire.yml.

This test case verifies that MAC expires within 10 mins if traffic
is not flowing using it.

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: Incorporate swssconfig step to set fdb_aging_timer in fdb_mac_expire.yml

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: minor changes in logs

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: minor log changes to show time correctly.
Example:
"MAC Entires are Cleared within 100 secs."
instead of
"MAC Entires are Cleared within 2*50 secs."

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: Address review comments related to sonic-clear, -it option and block-always.

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>

* [fdb_mac_expire.yml]: Change "sonic-clear fdb all" to "Clear FDB table".

Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>
…ore rebooting (sonic-net#975)

- fast-reboot script is an adapted version from 201811 branch. The change is around syncd
  stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown.
- Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob
  allow_vlan_flooding to ignore this symptom and continue with fast-reboot.

Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…analyzer.yml (sonic-net#963)

The copy files task was after the fail tests. In case of failure, the
copy task would never get a chance to run. This commit
adjusted the task sequence. In case of failure, copy the files, then
fail the test.

The original copy task copies files with deep folder structure.
This issue was also fixed in this commit.

Signed-off-by: Xin Wang <xinw@mellanox.com>
…sonic-net#968)

* fix grep ipv6 addr issue

* Add Mellanox onyx fanout switch deploy yml and template

* fix typo

* remove debug code

* revert the change to check_pfcwd_fanout.yml and deploy_pfcwd_fanout.yml

* fix typo
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
…sonic-net#997)

By default the log analyzer generate a dump which collect all the
available log files by default in case of failure. This unnecessary and
the dump file could be too big.
This fix is to generate a dump to collect log within 1 hour by default.
If more log is needed, parameter 'dump_since' can be used.

Signed-off-by: Xin Wang <xinw@mellanox.com>
)

* Upgrade FW for mellanox before fast-reboot

* Move some condition check to the main file
)

* [warm/fast reboot] make sure that /etc/sonic/config_db.json exsits after upgrade

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* [warm reboot] save config after warm reboot into new image

When new image is defined, test removed /host/config_db.json
before warm rebooting. So after the device boots up, it will
miss /etc/sonic/config_db.json. It is not an issue for the
device to stay up. But it will be an issue when device reboot
again (cold or fast).

Signed-off-by: Ying Xie <ying.xie@microsoft.com>

* review comments
…-net#895)

mirror_session.py file is deprecated; use config mirror_session command instead

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…#920)

Due to the current issues with ip route change with FRR,
change all the ip route commands to vtysh commands.

Remove the current testcase_6 since it's overlapped with testcase_8.

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Stablize the test by adding pause after the route change

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…#932)

Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
@stcheng stcheng closed this Jul 15, 2019
@stcheng stcheng deleted the 201811 branch July 15, 2019 18:58
ronan-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 13, 2026
…d_services.py (sonic-net#1009)

### Description of PR


Summary: The test looks for the presense of the telemetry image, and if
found tries to start the container. A change upstream resulted in the
telemetry container no longer being started by default, so the presense
of the image does not imply the container is running.

We could change this test to manually start telemetry, but I think a
better solution is to move to gNMI whose image is always present and
hsould always be started.

Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement


### Back port request
I'm not 100% if this is needed in 05, but running this in 05 will not
break the test. So either the test is failing and will now pass, or it's
passing and will continue to pass
- [x] 202505
- [x] 202511

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?
Manually ran the test.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Co-authored-by: Ronan Mac Fhlannchadha <ronan@nexthop.io>
ronan-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Feb 13, 2026
…d_services.py (sonic-net#1009)

Summary: The test looks for the presense of the telemetry image, and if
found tries to start the container. A change upstream resulted in the
telemetry container no longer being started by default, so the presense
of the image does not imply the container is running.

We could change this test to manually start telemetry, but I think a
better solution is to move to gNMI whose image is always present and
hsould always be started.

Fixes # (issue)

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

I'm not 100% if this is needed in 05, but running this in 05 will not
break the test. So either the test is failing and will now pass, or it's
passing and will continue to pass
- [x] 202505
- [x] 202511

Manually ran the test.

<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
fraserg-arista pushed a commit to fraserg-arista/sonic-mgmt that referenced this pull request Feb 24, 2026
### Description of PR
The current calculation method for the RX PPS rate for the COPP tests is
not very accurate (in some cases 130-150% above nominal) due to the
reliance of sampling received packets at the ptf container on the
testbed server after sending the packet stream, while also using a
timeout to wait for said packets to finish arriving. This does not
capture the in-flight RX PPS rate, but rather takes an average outside
of the actual packet transmission window, and incurs additional
inaccuracies due to the wait time at the end.

A more accurate approach implemented here is to take two snapshots of
the RX packet count at the NN agent on the dut itself while the packet
stream is already running, and calculate the difference.



Summary:
Fixes # (issue)

### Type of change
- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [x] 202412
- [x] 202511

### Approach
#### What is the motivation for this PR?
To fix neighbor_miss tests failing for TH5 duts on 202412, and enhance
the COPP tests overall for more accurate results.

#### How did you do it?
Updated the calculation method used to get the RX PPS rate for the COPP
tests.

#### How did you verify/test it?
Ran the copp tests and verified that the resulting RX PPS values were
within range.

---------

Signed-off-by: Christopher Croy <ccroy@arista.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.