Closed
Conversation
Ansible timeout internal variables in cache in 10 minutes by default.
However, we have some playbooks run more more than 10 minutes and would
lose some ansible variables during execution. The error message wasn't
super clear. It would be something like following:
fatal: [host_name]: FAILED! => {"failed": true, "msg": "ERROR! 'ansible
_Ethernet0' is undefined"}
And the line number would point at the next line after the task actually
took too long.
Warm reboot test is currently taking about 11 minutes to complete and
it triggered this caching issue.
Signed-off-by: Ying Xie <ying.xie@microsoft.com>
Signed-off-by: Mykola Faryma <mykolaf@mellanox.com>
Fix the following issue: AttributeError: 'module' object has no attribute 'Ether' Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…tected (sonic-net#805) * [warm/fast reboot] fail fast/warm reboot test if BGP GR timeout is detected Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm/fast reboot] get the actual ASN for printing Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…et#798) The snmp pfc_counter check is to validate if the interfaces have expected attributes. However, when expected attributes are missing, the script failed with error "'dict object' has no attribute". This change improved the error handling using a more robust syntax. Signed-off-by: Xin Wang <xinw@mellanox.com>
The current set of ACL rules will block BGP traffic. All the routes learned via BGP will timeout after the ACL rules are loaded. Then there is no route for the PTF injected packets. Packets accepted by ACL rules would not be forwarded anyway. This update is to add two ACL rules to allow BGP traffic. Signed-off-by: Xin Wang <xinw@mellanox.com>
Signed-off-by: Roman Kachur <romankac@mellanox.com>
) In ARP testing, the script needs to run "ip neigh flush all" couple of times to clean up ARP table. Occasionally flushing ARP table may fail with error "*** Flush not complete bailing out after 10 rounds" on ptf32 or ptf64 topology. The reason is that BGP peers are configured on these topologies although the PTF container does not have BGP running. From time to time, DUT will try to contact the BGP peers. ARP requests are firstly sent out. This will create some INCOMPLETE entries in ARP table for the BGP peers. If "ip neigh flush all" is executed at the same moment, the command may fail with "*** Flush not complete bailing out after 10 rounds". The fix is to simply ignore the error of "ip neigh flush all". Purpose of the "ip neigh flush all" command is to cleanup ARP entries generated by previous testing. It doesn't matter when there are ARP entries generated by other activities not flushed. Meanwhile, I changed the command to "ip -stats neigh flush all" to have more detailed output for debug in case of failure. Signed-off-by: Xin Wang <xinw@mellanox.com>
In logAnalyzer related scripts, ansible_date_time is used for generating
testname_unique which will be used as folder name for storing
logAnalyzer results. The drawback is that ansible_date_time is fixed
in single ansible playbook execution. If logAnalyzer is called
multiple times, the earlier results could be overwritten by the latest
results.
The fix is to replace ansible_date_time with:
{{lookup('pipe','date +%Y-%m-%d-%H:%M:%S')}}
There are two differences:
1. Each time the lookup plugin is called, the returned date and time
would be different (interval of two calls longer than 1 second)
2. The lookup plugin gets date and time of the sonic-mgmt container,
not date and time of DUT.
Signed-off-by: Xin Wang <xinw@mellanox.com>
…tch (sonic-net#826) 192.168.0.0/21 belongs ot the vlan. Leaf switch should not advertise routes in this segment to carve out IP address ranges from the VLAN. Doing so would cause T1 to VLAN IO validation failure. fast/warm reboot tests depend on the IO validation would also fail. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [arp_responder] only respond to arp request, ignore arp response - Check incoming packet type, if it is ARP response, don't reply again. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm/fast reboot] add vlan arp ping check during warm reboot - Making sure that ARP within vlan continue working during warm reboot Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Define and use ARP_OP_REQUEST
…c-net#833) - Refactoring repeating code to StateMachine class. - Tracking flooding state in StateMachine too. - Also watch VLAN ARP state transitions but not failing on it. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…t#831) Previous PR sonic-net#822 broke the log analyzer. This commit is to fix the issue introduced by PR sonic-net#822. PR sonic-net#822 is to replace the method of getting timestamp from ansible_date_time to the pipe plugin. Log analyzer needs a variable testname_unique to have timestamp in its value. Script run_command_with_log_analyzer.yml uses include_vars to include variables defined in vars/run_config_test_vars.yml. Ansible includes variables dynamically when 'include_vars' is used. Consequence is that the testname_unique variable defined in the vars/run_config_test_vars.yml is re-evaluated when it is referenced. This caused log analyzer to use inconsistent start and end mark. To fix this issue, a unique timestamp is always generated before include_vars. Then the timestamp is feed to the testname_unique defined in the vars file. This approach can guarantee consistent run id for each log analyzer execution. And the run id would also be different for different log analyzer execution. The other scripts changed in this commit have the similar issue. Signed-off-by: Xin Wang <xinw@mellanox.com>
…ic-net#839) The existing script does not restore lag rate setting on VMs in case of failure. This improvement is to restore lag setting on VMs if the testing failed. Signed-off-by: Xin Wang <xinw@mellanox.com>
…onic-net#842) After applying acltb_test_rules_part_1.json BGP sessions may go down before we apply acltb_test_rules_part_2.json (which had BGP ACL forward rules); This results in BGP flap during ptf test run; It is safer to apply BGP ACL forward rules first to avoid BGP flapping. Signed-off-by: Stepan Blyschak <stepanb@mellanox.com>
…cted (sonic-net#844) PR sonic-net#831 does not fully fix the issue introduced by PR sonic-net#822. Ansible's include_vars module could not override variable value previous defined by set_fact. Variables in vars/run_config_test_vars.yml may still have old value. The change is to avoid using include_vars. The variables defined in run_config_test_vars.yml are moved into script run_command_with_log_analyzer.yml. The vars files are deleted. The same change is made to other scripts using the same pattern. Signed-off-by: Xin Wang <xinw@mellanox.com>
If use apt_key module for getting docker official GPG key, there would be cert validation issue. Replace the apt_key module with 'curl' command recommended on docker official documentation site.
… PTF container (sonic-net#836) The PTF container will be destroyed if testbed-cli.sh remove-topo is executed. Run testbed-cli.sh add-topo will add a new PTF conainer. Usually the new PTF container will have a new MAC address. If add-topo is executed immediately after remove-topo, ARP table of neighbor switches and hosts may still have entry of the old PTF MAC address. This would cause connectivity issue to the new PTF container for a while until the old PTF MAC address is expired. This workaround is to send out an ARPing from the PTF container querying mgmt_gw after new PTF container is deployed and attached to network. The ARPing request will be broadcasted to all neighbors on same LAN and will refresh ARP table of neighbors with new MAC address of new PTF. Signed-off-by: Xin Wang <xinw@mellanox.com>
…t#857) Otherwise, if 2 systems have names where one is prefix of the other one, parsing of the shorter name will come up with 2 lines. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
sonic-net/sonic-utilities#504 This is to make all the commands backwards compatible Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…sonic-net#823) * [ptf_runner] Save ptf log to script executing host in case of failure The PTF log and pcap files are useful for debugging in case of PTF script failed. However, these files are in the PTF container and could be lost when the PTF container is re-deployed. This improvement is to save the log and pcap files to the script executing host when the PTF script is failed. Signed-off-by: Xin Wang <xinw@mellanox.com> * [ptf_runner] Add option for specifying whether to save ptf log The previous commit changed the default behavior. This change is to add an option for specifying whether to save ptf log in case of failure. For example: ansible-playbook <some_test>.yml ... -e save_ptf_log=yes
…hange (sonic-net#866) PFC_WD_TABLE --> PFC_WD Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…nic-net#875) Signed-off-by: Qi Luo <qiluo-msft@users.noreply.github.com>
* [fast/warm reboot] improve new image installation code - Allow new_sonic_image being defined as empty string. It causes skipping image installation. - Rename new_image_location to a generic name. - Display defined new image url. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [fast/warm reboot] allow DUT to stay in the warm/fast reboot target release This feature is needed in order to test ugprade path. Where we might upgrade from one version to another, and more. We want the system to stay in target release for next steps. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * Address review comments, test issues and some minor touch-ups Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [fast/warm reboot] add knob to clean up old iamges on DUT before warm/fast reboot When new image is specified for fast/warm reboot. The new image will be installed. However, if the specified image is already installed on the target DUT, then sonic_install will fail and fast/warm reboot will reboot into current image. Add a knob to cleanup old images so that the installing of new image will have a better chance to succeed. Signed-off-by: Ying Xie <ying.xie@microsoft.com> * address review issue
…les (sonic-net#865) The ntpd may generate 'ERR ntpd' in syslog and caused unnecessary test case failure. Previous PR sonic-net#816 added a matching pattern of 'ERR ntpd' in loganalyzer igonre files to ignore the ntpd error messages. However, ntpd may generate two formats of error messages. The previously added matching pattern can only match one of the formats. This change is to update the pattern to match both of the formats. Signed-off-by: Xin Wang <xinw@mellanox.com>
* Add many testcases support to t0-56 * Fix bgp_speaker for t0-56
…c-net#930) * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml * Moved image processing from advanced-reboot.yml to separate file reboot-image-handle.yml
* Extend warm-reboot test to include the BGP sad pass
…net#893) * Do not crash in case data plane never stop on fast-reboot
* preboot LAG sad path automation for neigh_lag_down and dut_lag_down scenarios
* Fix testbed_mtu for tasks that invoke fib_test * Set socket buffer size to 16k
) - Improve the data test warm up code: Let the data plane IO stablize for 30 seconds before testing. We observed ptf instability causing the test to fail. - Remove config_db.json when fast-reboot into a new image. We want the new image to reload minigraph in this case. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
* [fdb_mac_expire.yml]: FDB MAC Expire test case. [fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table. [fdb.yml]: include fdb_mac_expire.yml. This test case verifies that MAC expires within 10 mins if traffic is not flowing using it. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: FDB MAC Expire test case. [fdb_mac_expire_test.py]: PTF helper to add Mac in L2 table. [testcases.yml]: include fdb_mac_expire.yml. This test case verifies that MAC expires within 10 mins if traffic is not flowing using it. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Incorporate swssconfig step to set fdb_aging_timer in fdb_mac_expire.yml Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: minor changes in logs Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: minor log changes to show time correctly. Example: "MAC Entires are Cleared within 100 secs." instead of "MAC Entires are Cleared within 2*50 secs." Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Address review comments related to sonic-clear, -it option and block-always. Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com> * [fdb_mac_expire.yml]: Change "sonic-clear fdb all" to "Clear FDB table". Signed-off-by: Praveen Chaudhary<pchaudhary@linkedin.com>
…ore rebooting (sonic-net#975) - fast-reboot script is an adapted version from 201811 branch. The change is around syncd stop: in 201803 branch, if it is Broadcom platform, request syncd to perform cold shutdown. - Mellanox 201803 branch has a vlan FDB issue causing all vlan IO to flood. Add a knob allow_vlan_flooding to ignore this symptom and continue with fast-reboot. Signed-off-by: Ying Xie <ying.xie@microsoft.com>
…analyzer.yml (sonic-net#963) The copy files task was after the fail tests. In case of failure, the copy task would never get a chance to run. This commit adjusted the task sequence. In case of failure, copy the files, then fail the test. The original copy task copies files with deep folder structure. This issue was also fixed in this commit. Signed-off-by: Xin Wang <xinw@mellanox.com>
…sonic-net#968) * fix grep ipv6 addr issue * Add Mellanox onyx fanout switch deploy yml and template * fix typo * remove debug code * revert the change to check_pfcwd_fanout.yml and deploy_pfcwd_fanout.yml * fix typo
Signed-off-by: Nazarii Hnydyn <nazariig@mellanox.com>
…sonic-net#997) By default the log analyzer generate a dump which collect all the available log files by default in case of failure. This unnecessary and the dump file could be too big. This fix is to generate a dump to collect log within 1 hour by default. If more log is needed, parameter 'dump_since' can be used. Signed-off-by: Xin Wang <xinw@mellanox.com>
) * [warm/fast reboot] make sure that /etc/sonic/config_db.json exsits after upgrade Signed-off-by: Ying Xie <ying.xie@microsoft.com> * [warm reboot] save config after warm reboot into new image When new image is defined, test removed /host/config_db.json before warm rebooting. So after the device boots up, it will miss /etc/sonic/config_db.json. It is not an issue for the device to stay up. But it will be an issue when device reboot again (cold or fast). Signed-off-by: Ying Xie <ying.xie@microsoft.com> * review comments
…-net#895) mirror_session.py file is deprecated; use config mirror_session command instead Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…#920) Due to the current issues with ip route change with FRR, change all the ip route commands to vtysh commands. Remove the current testcase_6 since it's overlapped with testcase_8. Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Stablize the test by adding pause after the route change Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
…#932) Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
Signed-off-by: Shu0T1an ChenG <shuche@microsoft.com>
ronan-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Feb 13, 2026
…d_services.py (sonic-net#1009) ### Description of PR Summary: The test looks for the presense of the telemetry image, and if found tries to start the container. A change upstream resulted in the telemetry container no longer being started by default, so the presense of the image does not imply the container is running. We could change this test to manually start telemetry, but I think a better solution is to move to gNMI whose image is always present and hsould always be started. Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request I'm not 100% if this is needed in 05, but running this in 05 will not break the test. So either the test is failing and will now pass, or it's passing and will continue to pass - [x] 202505 - [x] 202511 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? Manually ran the test. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Co-authored-by: Ronan Mac Fhlannchadha <ronan@nexthop.io>
ronan-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Feb 13, 2026
…d_services.py (sonic-net#1009) Summary: The test looks for the presense of the telemetry image, and if found tries to start the container. A change upstream resulted in the telemetry container no longer being started by default, so the presense of the image does not imply the container is running. We could change this test to manually start telemetry, but I think a better solution is to move to gNMI whose image is always present and hsould always be started. Fixes # (issue) <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement I'm not 100% if this is needed in 05, but running this in 05 will not break the test. So either the test is failing and will now pass, or it's passing and will continue to pass - [x] 202505 - [x] 202511 Manually ran the test. <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
fraserg-arista
pushed a commit
to fraserg-arista/sonic-mgmt
that referenced
this pull request
Feb 24, 2026
### Description of PR
The current calculation method for the RX PPS rate for the COPP tests is
not very accurate (in some cases 130-150% above nominal) due to the
reliance of sampling received packets at the ptf container on the
testbed server after sending the packet stream, while also using a
timeout to wait for said packets to finish arriving. This does not
capture the in-flight RX PPS rate, but rather takes an average outside
of the actual packet transmission window, and incurs additional
inaccuracies due to the wait time at the end.
A more accurate approach implemented here is to take two snapshots of
the RX packet count at the NN agent on the dut itself while the packet
stream is already running, and calculate the difference.
Summary:
Fixes # (issue)
### Type of change
- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
- [ ] Skipped for non-supported platforms
- [x] Test case improvement
### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [x] 202412
- [x] 202511
### Approach
#### What is the motivation for this PR?
To fix neighbor_miss tests failing for TH5 duts on 202412, and enhance
the COPP tests overall for more accurate results.
#### How did you do it?
Updated the calculation method used to get the RX PPS rate for the COPP
tests.
#### How did you verify/test it?
Ran the copp tests and verified that the resulting RX PPS values were
within range.
---------
Signed-off-by: Christopher Croy <ccroy@arista.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[everflow]: Replace deprecated mirror_session.py file with CLI (#895)
[everflow]: Change the test command from ip route to vtysh (#920)
[everflow]: Remove deprecated tests (#923)
[everflow]: Remove unused variables (#931)
[everflow]: Add pause after route change (#942)
[EVERFLOW]: Add EVERFLOW policer test with DSCP value/mask (#932)
[everflow]: Fix the tearing down procedure order (#988)