Skip to content

Fix test pfcwd cli 202505#20246

Closed
vivekverma-arista wants to merge 471 commits intosonic-net:masterfrom
vivekverma-arista:fix-test-pfcwd-cli-202505
Closed

Fix test pfcwd cli 202505#20246
vivekverma-arista wants to merge 471 commits intosonic-net:masterfrom
vivekverma-arista:fix-test-pfcwd-cli-202505

Conversation

@vivekverma-arista
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Fixes #714, #18496

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?

The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?

Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

echuawu and others added 30 commits July 4, 2025 00:52
…le (sonic-net#16300)

Update background traffic to make pfcwd timer accuracy test more stable

Change-Id: I2d3146b4bd1a0601e4cfed3c5044381577504dcd
What is the motivation for this PR?
Configure t1-isolated-d32 default routes through it's TORs
…#19329)

What is the motivation for this PR?
Starting with Python 3.3, collections.Iterable was deprecated in favor of collections.abc.Iterable, though it remained temporarily supported for backward compatibility. However, as of Python 3.10, the old reference has been officially removed. Doc
image

Since we are upgrading Python from 3.8 to 3.12—where collections.Iterable is no longer supported—we will update all such references to use collections.abc.Iterable to ensure compatibility and prevent runtime errors.

How did you do it?
We will update all such references to use collections.abc.Iterable to ensure compatibility and prevent runtime errors.

How did you verify/test it?
We need to make sure that this change won't affect current test firstly -- test by pipeline itself. And then, we need to make sure that this change works in the new version -- test locally.
…-net#19324)

What is the motivation for this PR?
Previously I xfailed route perf test with issue sonic-net#18893 in PR test because has about 20% chance to fail in PR test and blocked PR test. With investigation, the failure reason is t1-lag KVM need more time to install/withdraw routes.

How did you do it?
Add more wait time for KVM to install/withdraw routes in route perf test.

How did you verify/test it?
Tested with Elastictest for 20 runs and all passed.
On certain platforms test_crm_neighbor can create commands that go beyond our character limit, resulting in errors like:
[Errno 7] Argument list too long: '/bin/sh
…et#19139)

What is the motivation for this PR?
bgp/test_bgp_suppress_fib.py::test_suppress_fib_stress failing

How did you do it?
Wait for config to properly take effect bgp/test_bgp_suppress_fib.py::test_credit_loop so subsequent tests can run on a clean state.

How did you verify/test it?
bgp/test_bgp_suppress_fib.py::test_suppress_fib_stress no longer has packet count mismatch failures.
reboot call is still using the outdated argument `plt_reboot_ctrl_overwrite`
instead of `return_after_reconnect` that was introduced in sonic-net#16031
Description of PR
Summary: Skip BGP check in teardown if --skip-sanity is used
Fixes sonic-net#18407

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement
Back port request
 202012
 202205
 202305
 202311
 202405
 202411
 msft-202405
 msft-202412
Approach
What is the motivation for this PR?
The issue is described here: sonic-net#18407

How did you do it?
Skip bgp check in teradown if --skip-sanity was passed while running the test.

co-authorized by: [email protected]
…onic-net#18867)

address

Update vlan ping test to override the affection of secondary vlan ip
address
Related community PR sonic-net#18399
1. Need a post check after restarting pmon, otherwise the pmon could not fully started and it will fail the next case.
2. Need to restore the DPU admin on status if the check after shutdown DPUs fails.

Change-Id: I80538d3a66b9c5c7d590f51d7c6703f62e982fe4
…ic-net#18798)

Add Mellanox-SN4700-V64 into mellanox_dualtor_hwskus
Update key sonic_hwsku for parameter host_vars
…est_cacl_application for PR test (sonic-net#19351)

Description of PR
Summary: Original PR: sonic-net#18834
This PR updates the iptables and ip6tables rules to block incoming BGP (TCP port 179) traffic on the eth0 interface. This change ensures that BGP sessions are only allowed on non-management interfaces.
Fixes: N/A

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement
Back port request
 202205
 202305
 202311
 202405
 202411
 202505
Approach
What is the motivation for this PR?
To support test updates in this PR: sonic-host-services#197.
Additionally, it ensures BGP port 179 is not exposed on the management interface (eth0).

How did you do it?
How did you verify/test it?
On t0-64 testbed
What is the motivation for this PR?
There are so many memory above threshold alarm in nightly test

How did you do it?
Update the FRR memory threshold and make the alarm more readable

memory_increase_threshold, FRR has it's own memory management system, not return the memory to system immediately, increase the threshold.
1: top:zebra: update from 64 to 128M
2: frr_bgp: update from 32 to 64M
3: frr_zebra: update from 16 to 64M

memory_high_threshold, frr bgp memory usage related to the count of neighbors, increase the threshold. we need to set the threshold according to the count of neighbors in the further.
1: frr_bgp: update from 128 to 256M

How did you verify/test it?
Run nightly test
https://elastictest.org/scheduler/testplan/685ac58d2461750d1f5a11c9
…et#19094)

Approach
What is the motivation for this PR?
Remove Ethernet512, Ethernet513 mapping for 7060X 128 port skus as they are not needed
… a WRED profile named 'AZURE_LOSSLESS'. (sonic-net#19246)

What is the motivation for this PR?
The test_ecn_config_update.py test fails on devices that do not have a WRED_PROFILE named AZURE_LOSSLESS.

How did you do it?
Instead of updating the WRED_PROFILE named AZURE_LOSSLESS, the test now updates all WRED profiles found in CONFIG DB and then verifies that these updates are applied to ASIC DB.
Note: In order for this test to pass, changes on the GCU side are also needed. Here is the PR in sonic-utilities for GCU changes: sonic-net/sonic-utilities#3910

How did you verify/test it?
Tested on a Mellanox switch with 3 WRED profiles, none of which were named AZURE_LOSSLESS. The old version of the test failed, while the new version passed.

Signed-off-by: Mahdi Ramezani <[email protected]>
…#19426)

What is the motivation for this PR?
Add topo t1-isolated-d510u2 in veos

How did you do it?
Add topo t1-isolated-d510u2 in veos

How did you verify/test it?
Verified by deploy topo.
…19136) (sonic-net#19425)

What is the motivation for this PR?
Support Arista-7050CX3-32S-C28S16 in port_utils

How did you do it?
Update port_alias_to_name_map in port_utils.py

How did you verify/test it?
Verified by deploy C28S16 testbed.
1.Add more timeout for ptf to handle a large scale of bgp packets after
config reload/bgp restart/reboot
2.Add BGP route sync check
What is the motivation for this PR?
Few dut console tests were failing on Dualtor testbeds, because "sonic_lab_console_links.csv" file was not created.

How did you do it?
Added support to generate "sonic_lab_console_links.csv" file from the testbed.yaml file.

How did you verify/test it?
Ran dut_console tests and verified that test_escape_character and test_idle_timeout are passing.
… bug sonic-net/sonic-buildimage#22370. (sonic-net#19311)

What is the motivation for this PR?
test_gcu_acl_scale_rules was failed due a timeout on this platform.
This issue has an open bug sonic-net/sonic-buildimage#22370.

How did you do it?
Increase the timeout for running the command.

How did you verify/test it?
rerun the test.
…nder everflow/test_everflow_testbed on Arista-7260CX3 (sonic-net#19308)

What is the motivation for this PR?
Support for everflow over ipv6 encap cases was added by PR 16836
However, this does not appear to have SAI support on 7260CX3
When a mirror session for everflow over v6 becomes active on 7260CX3, the orchagent crashes
|E|SAI_STATUS_NOT_SUPPORTED is seen in sairedis.rec
This is being tracked in 627 and public issue 19096
The DUT will never forward the test traffic encapsulated over IPv6+GRE/ERSPAN to the collector since it's unsupported
How did you do it?
How did you verify/test it?
Tested with Arista-7260CX3-D108C8 DUT in a dt120 topology

Any platform specific information?
Yes - Arista-7260CX3(TH2)
…emove packet count noise (sonic-net#19380)

What is the motivation for this PR?
We run test cases one by one, however, when count packets in next test case, it may count some packets from previous test case.

How did you do it?
To remove the noise, we use different icmp type for each traffic thread in test case, so that the packet count is more accurate.

How did you verify/test it?
Run test on 5640 testbed with 510 bgp session
Description of PR
Summary:
Fixes #33668010
systemctl restart bgp.service fails on multi-asic devices.
Enhacne and add compatible logic for multi-asic devices.

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement
Back port request
 202205
 202305
 202311
 202405
 202411
 202505
Approach
What is the motivation for this PR?
systemctl restart bgp.service fails on multi-asic devices.

How did you do it?
Enhacne and add compatible logic for multi-asic devices.

How did you verify/test it?
https://elastictest.org/scheduler/testplan/686a3db4c452a23450444da8?testcase=test_pretest.py%7C%7C%7Cvms-kvm-four-asic-t1-lag_219086&type=log
image

signed-off-by: [email protected]
…or default route has not populated yet (sonic-net#19316)

What is the motivation for this PR?
This PR does the following:

Uses netstat to ensure that there is an established TCP connection between the client and server which is a more reliable check instead of pid check. This new check will help ensure that client is connected and can receive all notifications.

Ensures that we are giving enough time for default route to populate in APPL_DB after bgp sessions have been restored. There are some situations where we grab 10 updates, but default route has not been populated yet, so we miss the update. We will let the query run for longer and check less frequently to ensure that we see the default route entries after restoring bgp sessions.

How did you do it?
Code change

How did you verify/test it?
202411 test
sonic-net#19431)

Description of PR
Summary:
Fixes 33680685
Recover with golden config in pretest always fail.

In sanity check recover, the recover requires running golden config.
But running golden config is generated after sanity check.
Hence the recover in pre-sanity in pretest will always fail because of running golden config doesn't exist:
Image

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

Approach
What is the motivation for this PR?
Fix PR test instability.

How did you do it?
Fall back to config_db.json if running golden config file not exists.

How did you verify/test it?
Verified on physical testbeds

co-authorized by: [email protected]
saiarcot895 and others added 21 commits August 8, 2025 10:39
What is the motivation for this PR?
Update the mgmtvrf test case for ntp by having it use Chrony

How did you do it?
Reuse existing code that is in the common ntp_helper module instead of copy-pasting code here.

Signed-off-by: Saikrishna Arcot <[email protected]>
issue seen on 2700 https://dev.azure.com/mssonic/internal/_build/results?buildId=913225&view=logs&j=76acabad-01e9-5c52-6fe6-d396d63e85d2&t=55864d99-7fe9-5504-0078-bfbb010fc228&l=4109

2025-08-01T12:37:49.4730701Z 2025-08-01 12:37:43 : --------------------------------------------------
2025-08-01T12:37:49.4731666Z 2025-08-01 12:37:43 : Fails:
2025-08-01T12:37:49.4732373Z 2025-08-01 12:37:43 : --------------------------------------------------
2025-08-01T12:37:49.4733177Z 2025-08-01 12:37:43 : FAILED:dut:Traceback (most recent call last):
2025-08-01T12:37:49.4733944Z   File "/root/ptftests/py3/advanced-reboot.py", line 1445, in runTest
2025-08-01T12:37:49.4734679Z     self.handle_advanced_reboot_health_check()
2025-08-01T12:37:49.4735464Z   File "/root/ptftests/py3/advanced-reboot.py", line 1167, in handle_advanced_reboot_health_check
2025-08-01T12:37:49.4736205Z     self.examine_flow()
2025-08-01T12:37:49.4736865Z   File "/root/ptftests/py3/advanced-reboot.py", line 2138, in examine_flow
2025-08-01T12:37:49.4737694Z     self.disruption_stop = datetime.datetime.fromtimestamp(
2025-08-01T12:37:49.4738332Z                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2025-08-01T12:37:49.4739025Z TypeError: 'EDecimal' object cannot be interpreted as an integer
2025-08-01T12:37:49.4739366Z
2025-08-01T12:37:49.4740035Z 2025-08-01 12:37:43 : ==================================================
2025-08-01T12:37:49.4740823Z 2025-08-01 12:37:43 : Disabling arp_responder
2025-08-01T12:37:49.4741111Z
…c-net#19003)

* [crm] Fix test failures on SN4700 by batching neighbor creation

test_crm_nexthop_group was failing on sn4700 testbeds due to a known
orchagent crash. The crash is caused by a few factors:

- Expected neighbor entry missing from kernel, causing tunnel-route to be programmed
  for unresolved nexthop.
- tunnel-route being removed when nexthop group was deleted
- route bulker ending early due to ITEM_NOT_FOUND (not processing
  remaining bulk removes)
- next-hop group delete failing due to OBJECT_IN_USE.

This PR addresses the first issue.

During neighbor generation for the test, some of the neighbor updates
were being missed from APPL_DB on sn4700 devices. Previously, the
neighbors were programmed in the kernel in a single ip command, this fix splits
up the neighbor programming into batches in order to prevent updates
from being dropped.

Signed-off-by: Nikola Dancejic <[email protected]>

* fix pre-commit errors

* fix pre-commit

---------

Signed-off-by: Nikola Dancejic <[email protected]>
…onic-net#19989)

* Check the sai.profile lines for comments.

* Add to check for space before the # sign.
Description of PR
Summary:
Fixes # (issue)

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

Approach
What is the motivation for this PR?
There is a bmp container per ASIC so we need to use the correct per-ASIC names if we are on a multi-ASIC DUT.

How did you do it?
I updated the restart command to use the correct API for restarting a per-ASIC service.

How did you verify/test it?
We ran the test locally on an Arista multi-ASIC DUT.

signed-off-by: [email protected]
…ic-net#20116)

hat is the motivation for this PR?
Increase the timeout since the test drives all 16 cores to 100% for over 20 seconds, which leads to bulk counter timeouts.

How did you do it?
Increase the timeout

How did you verify/test it?
Verified it in the internal tests.

collected 1 item

snmp/test_snmp_cpu.py::test_snmp_cpu[str4-sn5640-3] PASSED               [100%]DEBUG:tests.conftest:[log_custom_msg] item: <Function test_snmp_cpu[str4-sn5640-3]>
Any platform specific information?
str4-sn5640-3

Supported testbed topology if it's a new test case?
t1-isolated-d56u1-lag
Summary:
Current 'autoneg' column in links.csv only support 'on'. If it's 'off' or other settings, it will default to platform.json behavior.
This PR add the support for 'off' settings.
For DUT which want to use the default behavior, it can leave that column empty, or use any other value, e.g. 'none'

Note: There will be a behavior change if user is using off in links.csv already for their DUT.
old behavior: the autoneg settings will be derived from platform.json, which chould be on or off or not defined.
new behavior: will be off always.
If users is using off already, they need to update their links.csv to leave autoneg field as empty if they want to use the default settings in platform.json.

What is the motivation for this PR?
update autoneg setting to support 'off'

How did you do it?
Check autoneg value for both on and off in minigraph
…l. (sonic-net#20106)

Description of PR
Since the latest image mounts /tmp as tmpfs, it uses RAM instead of disk storage. To simulate a disk full scenario, use the /host directory instead.

The /host path is backed by the host’s actual disk (/dev/sda1, ext4), not memory. Therefore, operations performed in /host consume real disk space, and commands like fallocate behave as expected.

Type of change
 Bug fix
 Testbed and Framework(new/improvement)
 New Test case
 Skipped for non-supported platforms
 Test case improvement

Approach
What is the motivation for this PR?
Since the latest image mounts /tmp as tmpfs, it uses RAM instead of disk storage. To simulate a disk full scenario, use the /host directory instead.

How did you do it?
Replace /tmp with /host

How did you verify/test it?
Test locally on testbed.

signed-off-by: [email protected]
What is the motivation for this PR?
Add autoneg config for sonic 202505 fanout. Set the autoneg according to the data in device_conn.

How did you do it?
Add autoneg config to sonic_deploy_202405.j2

How did you verify/test it?
deploy fanout with 202405 image

Any platform specific information?
Sonic switch
cherry-pick sonic-net#20005

Approach
What is the motivation for this PR?
Add the missing template for t1-isolated topo.

How did you do it?
Add the missing template for t1-isolated topo.

signed-off-by: [email protected]
What is the motivation for this PR?
It's an improvement for testbed vms75-t0-7050cx3-1. For this SKU, the kubelet needs more time(around 70s) to join the minikube cluster.

How did you do it?
Increased the wait time for joining node to cluster.

How did you verify/test it?
Run this test in testbed vms75-t0-7050cx3-1 to see if it passes.
…t correctly sent to the DUT (sonic-net#18012)

In case of a weak nic, when a packet is not received, resend the packet
Related PR: sonic-net#14139
… configuration (sonic-net#19965)

Currently we have logging logs attached to allure report but these logs do not
have date and time information and it makes debugging difficult when we
need to align the date and time logs from allure with other logs.

This change is to add customized format for the log messages that will
be attached to allure report
Add function to set counter poll interval
Ignore any router advertisements sent by a DUT, and don't set a default
route or an address based on it. This could happen if a T0 testbed with
radv running sends router advertisements on a VLAN interface, which may
result in the PTF container adding a default route on all of the VLAN
interfaces. This could result in some IPv6 test cases breaking.

Signed-off-by: Saikrishna Arcot <[email protected]>
…t#19869) (sonic-net#20198)

Ignore error during config reload in BGP/QOS/FPC test cases
Cherry-pick for sonic-net#19869

Why I did it
BGP/QOS/FPC test case failed because following error:
E 2025 Jul 28 04:41:57.469604 str2-msn2700-spy-1 ERR iptables: tac_connect_single: connection to 10.64.246.145:49 failed: Network is unreachable

These test case reload_config but not set ignore_loganalyzer parameter.
Because reload config will restart networking service, which will cause TACACS server unreachable during networking service shutdown.

Work item tracking
Microsoft ADO (number only):
How I did it
Set reload_config ignore_loganalyzer parameter in BGP/QOS/FPC test cases.

How to verify it
Pass all test case.

Tested branch (Please provide the tested image version)


Description for the changelog
Ignore error during config reload in BGP/QOS/FPC test cases

co-authorized by: [email protected]
@mssonicbld
Copy link
Copy Markdown
Collaborator

/azp run

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.