Fix test pfcwd cli 202505 by vivekverma-arista · Pull Request #20246 · sonic-net/sonic-mgmt

vivekverma-arista · 2025-08-14T06:12:55Z

Description of PR

Summary:
Fixes #714, #18496

Type of change

Back port request

Approach

What is the motivation for this PR?

Recent fix: #17411

The test was flaky before this fix (and continues to be so). When the test picks up an egress interface which happens to be a member of a LAG consisting of multiple members, only this member is stormed and some of the traffic successfully egresses out of the other LAG members leading to lesser drops than expected when PFCWD is triggered with DROP action. The proposed fix was to shut down all but one LAG members by reducing the number of min_links. But the same config on cEOS was missing therefore LAG doesn't come up after shutting down other LAG members.

This is being rectified in this change for cEOS neighbors.

How did you do it?

The proposed fix is to change the min_link setting for the involved port channel on the cEOS side as well.

How did you verify/test it?

Stressed this test 10 times on dualtor-120 and t0-116 with Arista 7260CX3 platform.

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

…le (sonic-net#16300) Update background traffic to make pfcwd timer accuracy test more stable Change-Id: I2d3146b4bd1a0601e4cfed3c5044381577504dcd

What is the motivation for this PR? Configure t1-isolated-d32 default routes through it's TORs

Signed-off-by: Longxiang Lyu <[email protected]>

…#19329) What is the motivation for this PR? Starting with Python 3.3, collections.Iterable was deprecated in favor of collections.abc.Iterable, though it remained temporarily supported for backward compatibility. However, as of Python 3.10, the old reference has been officially removed. Doc image Since we are upgrading Python from 3.8 to 3.12—where collections.Iterable is no longer supported—we will update all such references to use collections.abc.Iterable to ensure compatibility and prevent runtime errors. How did you do it? We will update all such references to use collections.abc.Iterable to ensure compatibility and prevent runtime errors. How did you verify/test it? We need to make sure that this change won't affect current test firstly -- test by pipeline itself. And then, we need to make sure that this change works in the new version -- test locally.

…-net#19324) What is the motivation for this PR? Previously I xfailed route perf test with issue sonic-net#18893 in PR test because has about 20% chance to fail in PR test and blocked PR test. With investigation, the failure reason is t1-lag KVM need more time to install/withdraw routes. How did you do it? Add more wait time for KVM to install/withdraw routes in route perf test. How did you verify/test it? Tested with Elastictest for 20 runs and all passed.

On certain platforms test_crm_neighbor can create commands that go beyond our character limit, resulting in errors like: [Errno 7] Argument list too long: '/bin/sh

…et#19139) What is the motivation for this PR? bgp/test_bgp_suppress_fib.py::test_suppress_fib_stress failing How did you do it? Wait for config to properly take effect bgp/test_bgp_suppress_fib.py::test_credit_loop so subsequent tests can run on a clean state. How did you verify/test it? bgp/test_bgp_suppress_fib.py::test_suppress_fib_stress no longer has packet count mismatch failures.

reboot call is still using the outdated argument `plt_reboot_ctrl_overwrite` instead of `return_after_reconnect` that was introduced in sonic-net#16031

Description of PR Summary: Skip BGP check in teardown if --skip-sanity is used Fixes sonic-net#18407 Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Back port request 202012 202205 202305 202311 202405 202411 msft-202405 msft-202412 Approach What is the motivation for this PR? The issue is described here: sonic-net#18407 How did you do it? Skip bgp check in teradown if --skip-sanity was passed while running the test. co-authorized by: [email protected]

…onic-net#18867) address Update vlan ping test to override the affection of secondary vlan ip address Related community PR sonic-net#18399

1. Need a post check after restarting pmon, otherwise the pmon could not fully started and it will fail the next case. 2. Need to restore the DPU admin on status if the check after shutdown DPUs fails. Change-Id: I80538d3a66b9c5c7d590f51d7c6703f62e982fe4

…ic-net#18798) Add Mellanox-SN4700-V64 into mellanox_dualtor_hwskus Update key sonic_hwsku for parameter host_vars

…est_cacl_application for PR test (sonic-net#19351) Description of PR Summary: Original PR: sonic-net#18834 This PR updates the iptables and ip6tables rules to block incoming BGP (TCP port 179) traffic on the eth0 interface. This change ensures that BGP sessions are only allowed on non-management interfaces. Fixes: N/A Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Back port request 202205 202305 202311 202405 202411 202505 Approach What is the motivation for this PR? To support test updates in this PR: sonic-host-services#197. Additionally, it ensures BGP port 179 is not exposed on the management interface (eth0). How did you do it? How did you verify/test it? On t0-64 testbed

What is the motivation for this PR? There are so many memory above threshold alarm in nightly test How did you do it? Update the FRR memory threshold and make the alarm more readable memory_increase_threshold, FRR has it's own memory management system, not return the memory to system immediately, increase the threshold. 1: top:zebra: update from 64 to 128M 2: frr_bgp: update from 32 to 64M 3: frr_zebra: update from 16 to 64M memory_high_threshold, frr bgp memory usage related to the count of neighbors, increase the threshold. we need to set the threshold according to the count of neighbors in the further. 1: frr_bgp: update from 128 to 256M How did you verify/test it? Run nightly test https://elastictest.org/scheduler/testplan/685ac58d2461750d1f5a11c9

…et#19094) Approach What is the motivation for this PR? Remove Ethernet512, Ethernet513 mapping for 7060X 128 port skus as they are not needed

… a WRED profile named 'AZURE_LOSSLESS'. (sonic-net#19246) What is the motivation for this PR? The test_ecn_config_update.py test fails on devices that do not have a WRED_PROFILE named AZURE_LOSSLESS. How did you do it? Instead of updating the WRED_PROFILE named AZURE_LOSSLESS, the test now updates all WRED profiles found in CONFIG DB and then verifies that these updates are applied to ASIC DB. Note: In order for this test to pass, changes on the GCU side are also needed. Here is the PR in sonic-utilities for GCU changes: sonic-net/sonic-utilities#3910 How did you verify/test it? Tested on a Mellanox switch with 3 WRED profiles, none of which were named AZURE_LOSSLESS. The old version of the test failed, while the new version passed. Signed-off-by: Mahdi Ramezani <[email protected]>

…#19426) What is the motivation for this PR? Add topo t1-isolated-d510u2 in veos How did you do it? Add topo t1-isolated-d510u2 in veos How did you verify/test it? Verified by deploy topo.

…19136) (sonic-net#19425) What is the motivation for this PR? Support Arista-7050CX3-32S-C28S16 in port_utils How did you do it? Update port_alias_to_name_map in port_utils.py How did you verify/test it? Verified by deploy C28S16 testbed.

1.Add more timeout for ptf to handle a large scale of bgp packets after config reload/bgp restart/reboot 2.Add BGP route sync check

What is the motivation for this PR? Few dut console tests were failing on Dualtor testbeds, because "sonic_lab_console_links.csv" file was not created. How did you do it? Added support to generate "sonic_lab_console_links.csv" file from the testbed.yaml file. How did you verify/test it? Ran dut_console tests and verified that test_escape_character and test_idle_timeout are passing.

… bug sonic-net/sonic-buildimage#22370. (sonic-net#19311) What is the motivation for this PR? test_gcu_acl_scale_rules was failed due a timeout on this platform. This issue has an open bug sonic-net/sonic-buildimage#22370. How did you do it? Increase the timeout for running the command. How did you verify/test it? rerun the test.

…nder everflow/test_everflow_testbed on Arista-7260CX3 (sonic-net#19308) What is the motivation for this PR? Support for everflow over ipv6 encap cases was added by PR 16836 However, this does not appear to have SAI support on 7260CX3 When a mirror session for everflow over v6 becomes active on 7260CX3, the orchagent crashes |E|SAI_STATUS_NOT_SUPPORTED is seen in sairedis.rec This is being tracked in 627 and public issue 19096 The DUT will never forward the test traffic encapsulated over IPv6+GRE/ERSPAN to the collector since it's unsupported How did you do it? How did you verify/test it? Tested with Arista-7260CX3-D108C8 DUT in a dt120 topology Any platform specific information? Yes - Arista-7260CX3(TH2)

…emove packet count noise (sonic-net#19380) What is the motivation for this PR? We run test cases one by one, however, when count packets in next test case, it may count some packets from previous test case. How did you do it? To remove the noise, we use different icmp type for each traffic thread in test case, so that the packet count is more accurate. How did you verify/test it? Run test on 5640 testbed with 510 bgp session

Description of PR Summary: Fixes #33668010 systemctl restart bgp.service fails on multi-asic devices. Enhacne and add compatible logic for multi-asic devices. Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Back port request 202205 202305 202311 202405 202411 202505 Approach What is the motivation for this PR? systemctl restart bgp.service fails on multi-asic devices. How did you do it? Enhacne and add compatible logic for multi-asic devices. How did you verify/test it? https://elastictest.org/scheduler/testplan/686a3db4c452a23450444da8?testcase=test_pretest.py%7C%7C%7Cvms-kvm-four-asic-t1-lag_219086&type=log image signed-off-by: [email protected]

…or default route has not populated yet (sonic-net#19316) What is the motivation for this PR? This PR does the following: Uses netstat to ensure that there is an established TCP connection between the client and server which is a more reliable check instead of pid check. This new check will help ensure that client is connected and can receive all notifications. Ensures that we are giving enough time for default route to populate in APPL_DB after bgp sessions have been restored. There are some situations where we grab 10 updates, but default route has not been populated yet, so we miss the update. We will let the query run for longer and check less frequently to ensure that we see the default route entries after restoring bgp sessions. How did you do it? Code change How did you verify/test it? 202411 test

sonic-net#19431) Description of PR Summary: Fixes 33680685 Recover with golden config in pretest always fail. In sanity check recover, the recover requires running golden config. But running golden config is generated after sanity check. Hence the recover in pre-sanity in pretest will always fail because of running golden config doesn't exist: Image Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Fix PR test instability. How did you do it? Fall back to config_db.json if running golden config file not exists. How did you verify/test it? Verified on physical testbeds co-authorized by: [email protected]

Signed-off-by: Longxiang Lyu <[email protected]>

What is the motivation for this PR? Update the mgmtvrf test case for ntp by having it use Chrony How did you do it? Reuse existing code that is in the common ntp_helper module instead of copy-pasting code here. Signed-off-by: Saikrishna Arcot <[email protected]>

…s on `t2_single_node` (sonic-net#18420)

issue seen on 2700 https://dev.azure.com/mssonic/internal/_build/results?buildId=913225&view=logs&j=76acabad-01e9-5c52-6fe6-d396d63e85d2&t=55864d99-7fe9-5504-0078-bfbb010fc228&l=4109 2025-08-01T12:37:49.4730701Z 2025-08-01 12:37:43 : -------------------------------------------------- 2025-08-01T12:37:49.4731666Z 2025-08-01 12:37:43 : Fails: 2025-08-01T12:37:49.4732373Z 2025-08-01 12:37:43 : -------------------------------------------------- 2025-08-01T12:37:49.4733177Z 2025-08-01 12:37:43 : FAILED:dut:Traceback (most recent call last): 2025-08-01T12:37:49.4733944Z File "/root/ptftests/py3/advanced-reboot.py", line 1445, in runTest 2025-08-01T12:37:49.4734679Z self.handle_advanced_reboot_health_check() 2025-08-01T12:37:49.4735464Z File "/root/ptftests/py3/advanced-reboot.py", line 1167, in handle_advanced_reboot_health_check 2025-08-01T12:37:49.4736205Z self.examine_flow() 2025-08-01T12:37:49.4736865Z File "/root/ptftests/py3/advanced-reboot.py", line 2138, in examine_flow 2025-08-01T12:37:49.4737694Z self.disruption_stop = datetime.datetime.fromtimestamp( 2025-08-01T12:37:49.4738332Z ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 2025-08-01T12:37:49.4739025Z TypeError: 'EDecimal' object cannot be interpreted as an integer 2025-08-01T12:37:49.4739366Z 2025-08-01T12:37:49.4740035Z 2025-08-01 12:37:43 : ================================================== 2025-08-01T12:37:49.4740823Z 2025-08-01 12:37:43 : Disabling arp_responder 2025-08-01T12:37:49.4741111Z

…c-net#19003) * [crm] Fix test failures on SN4700 by batching neighbor creation test_crm_nexthop_group was failing on sn4700 testbeds due to a known orchagent crash. The crash is caused by a few factors: - Expected neighbor entry missing from kernel, causing tunnel-route to be programmed for unresolved nexthop. - tunnel-route being removed when nexthop group was deleted - route bulker ending early due to ITEM_NOT_FOUND (not processing remaining bulk removes) - next-hop group delete failing due to OBJECT_IN_USE. This PR addresses the first issue. During neighbor generation for the test, some of the neighbor updates were being missed from APPL_DB on sn4700 devices. Previously, the neighbors were programmed in the kernel in a single ip command, this fix splits up the neighbor programming into batches in order to prevent updates from being dropped. Signed-off-by: Nikola Dancejic <[email protected]> * fix pre-commit errors * fix pre-commit --------- Signed-off-by: Nikola Dancejic <[email protected]>

…s not controllable (sonic-net#16936) Signed-off-by: vhlushko <[email protected]>

…onic-net#19989) * Check the sai.profile lines for comments. * Add to check for space before the # sign.

Description of PR Summary: Fixes # (issue) Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? There is a bmp container per ASIC so we need to use the correct per-ASIC names if we are on a multi-ASIC DUT. How did you do it? I updated the restart command to use the correct API for restarting a per-ASIC service. How did you verify/test it? We ran the test locally on an Arista multi-ASIC DUT. signed-off-by: [email protected]

…ic-net#20116) hat is the motivation for this PR? Increase the timeout since the test drives all 16 cores to 100% for over 20 seconds, which leads to bulk counter timeouts. How did you do it? Increase the timeout How did you verify/test it? Verified it in the internal tests. collected 1 item snmp/test_snmp_cpu.py::test_snmp_cpu[str4-sn5640-3] PASSED [100%]DEBUG:tests.conftest:[log_custom_msg] item: <Function test_snmp_cpu[str4-sn5640-3]> Any platform specific information? str4-sn5640-3 Supported testbed topology if it's a new test case? t1-isolated-d56u1-lag

Summary: Current 'autoneg' column in links.csv only support 'on'. If it's 'off' or other settings, it will default to platform.json behavior. This PR add the support for 'off' settings. For DUT which want to use the default behavior, it can leave that column empty, or use any other value, e.g. 'none' Note: There will be a behavior change if user is using off in links.csv already for their DUT. old behavior: the autoneg settings will be derived from platform.json, which chould be on or off or not defined. new behavior: will be off always. If users is using off already, they need to update their links.csv to leave autoneg field as empty if they want to use the default settings in platform.json. What is the motivation for this PR? update autoneg setting to support 'off' How did you do it? Check autoneg value for both on and off in minigraph

…l. (sonic-net#20106) Description of PR Since the latest image mounts /tmp as tmpfs, it uses RAM instead of disk storage. To simulate a disk full scenario, use the /host directory instead. The /host path is backed by the host’s actual disk (/dev/sda1, ext4), not memory. Therefore, operations performed in /host consume real disk space, and commands like fallocate behave as expected. Type of change Bug fix Testbed and Framework(new/improvement) New Test case Skipped for non-supported platforms Test case improvement Approach What is the motivation for this PR? Since the latest image mounts /tmp as tmpfs, it uses RAM instead of disk storage. To simulate a disk full scenario, use the /host directory instead. How did you do it? Replace /tmp with /host How did you verify/test it? Test locally on testbed. signed-off-by: [email protected]

What is the motivation for this PR? Add autoneg config for sonic 202505 fanout. Set the autoneg according to the data in device_conn. How did you do it? Add autoneg config to sonic_deploy_202405.j2 How did you verify/test it? deploy fanout with 202405 image Any platform specific information? Sonic switch

cherry-pick sonic-net#20005 Approach What is the motivation for this PR? Add the missing template for t1-isolated topo. How did you do it? Add the missing template for t1-isolated topo. signed-off-by: [email protected]

What is the motivation for this PR? It's an improvement for testbed vms75-t0-7050cx3-1. For this SKU, the kubelet needs more time(around 70s) to join the minikube cluster. How did you do it? Increased the wait time for joining node to cluster. How did you verify/test it? Run this test in testbed vms75-t0-7050cx3-1 to see if it passes.

…86 (sonic-net#19909) Add xfail for generic hash case test_lag_hash due to github issue sonic-net/sonic-buildimage#22586

…t correctly sent to the DUT (sonic-net#18012) In case of a weak nic, when a packet is not received, resend the packet Related PR: sonic-net#14139

… configuration (sonic-net#19965) Currently we have logging logs attached to allure report but these logs do not have date and time information and it makes debugging difficult when we need to align the date and time logs from allure with other logs. This change is to add customized format for the log messages that will be attached to allure report

Add function to set counter poll interval

…0212) Signed-off-by: Kevin Wang <[email protected]>

Ignore any router advertisements sent by a DUT, and don't set a default route or an address based on it. This could happen if a T0 testbed with radv running sends router advertisements on a VLAN interface, which may result in the PTF container adding a default route on all of the VLAN interfaces. This could result in some IPv6 test cases breaking. Signed-off-by: Saikrishna Arcot <[email protected]>

…t#19869) (sonic-net#20198) Ignore error during config reload in BGP/QOS/FPC test cases Cherry-pick for sonic-net#19869 Why I did it BGP/QOS/FPC test case failed because following error: E 2025 Jul 28 04:41:57.469604 str2-msn2700-spy-1 ERR iptables: tac_connect_single: connection to 10.64.246.145:49 failed: Network is unreachable These test case reload_config but not set ignore_loganalyzer parameter. Because reload config will restart networking service, which will cause TACACS server unreachable during networking service shutdown. Work item tracking Microsoft ADO (number only): How I did it Set reload_config ignore_loganalyzer parameter in BGP/QOS/FPC test cases. How to verify it Pass all test case. Tested branch (Please provide the tested image version) Description for the changelog Ignore error during config reload in BGP/QOS/FPC test cases co-authorized by: [email protected]

mssonicbld · 2025-08-14T06:13:03Z

/azp run

azure-pipelines · 2025-08-14T06:13:10Z

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

echuawu and others added 30 commits July 4, 2025 00:52

Update background traffic to make pfcwd timer accuracy test more stab…

a6e2e18

…le (sonic-net#16300) Update background traffic to make pfcwd timer accuracy test more stable Change-Id: I2d3146b4bd1a0601e4cfed3c5044381577504dcd

Add t1-isolated-d32 to tor_default_route (sonic-net#18969)

1e9bd27

What is the motivation for this PR? Configure t1-isolated-d32 default routes through it's TORs

[sanity] Fix sanity check bgp not ready after restart (sonic-net#19296)

cbf4584

Signed-off-by: Longxiang Lyu <[email protected]>

[dualtor][cisco] Fix test_encap_with_mirror_session (sonic-net#19337)

132d3e0

Signed-off-by: Longxiang Lyu <[email protected]>

Batch neighbor add/del commands in test_crm_neighbor (sonic-net#18625)

4613b98

On certain platforms test_crm_neighbor can create commands that go beyond our character limit, resulting in errors like: [Errno 7] Argument list too long: '/bin/sh

Fix tests/common/snappi_tests/traffic_generation.py (sonic-net#18934)

3b88a6a

reboot call is still using the outdated argument `plt_reboot_ctrl_overwrite` instead of `return_after_reconnect` that was introduced in sonic-net#16031

Ignore the expected bfd errors (sonic-net#18144)

44ffa09

Update vlan ping test to override the affection of secondary vlan ip (s…

7d32f38

…onic-net#18867) address Update vlan ping test to override the affection of secondary vlan ip address Related community PR sonic-net#18399

Update script to make script test_mux_port_iptables_entries pass (son…

ffab540

…ic-net#18798) Add Mellanox-SN4700-V64 into mellanox_dualtor_hwskus Update key sonic_hwsku for parameter host_vars

Remove Ethernet512, Ethernet513 mapping for Arista O128 skus (sonic-n…

86f5e62

…et#19094) Approach What is the motivation for this PR? Remove Ethernet512, Ethernet513 mapping for 7060X 128 port skus as they are not needed

[T1] Add topo t1-isolated-d510u2 in veos (sonic-net#19160) (sonic-net…

5731052

…#19426) What is the motivation for this PR? Add topo t1-isolated-d510u2 in veos How did you do it? Add topo t1-isolated-d510u2 in veos How did you verify/test it? Verified by deploy topo.

Enhance msft srv6 test cases (sonic-net#18866)

5a3152a

1.Add more timeout for ptf to handle a large scale of bgp packets after config reload/bgp restart/reboot 2.Add BGP route sync check

Fix github issue sonic-net#16529 (sonic-net#18117)

3093f5f

[dualtor] Skip warm/fast reboot cases on dualtor (sonic-net#19443)

88fbfca

Signed-off-by: Longxiang Lyu <[email protected]>

saiarcot895 and others added 21 commits August 8, 2025 10:39

Fix fib/test_fib.py::test_ecmp_group_member_flap to not filter port…

9181598

…s on `t2_single_node` (sonic-net#18420)

[platform] Fix the test_status_led error in case if the chassis led i…

b56aa9d

…s not controllable (sonic-net#16936) Signed-off-by: vhlushko <[email protected]>

Check for comments in sai.profile lines in the test_link_local_ip.py (s…

53c2e2b

…onic-net#19989) * Check the sai.profile lines for comments. * Add to check for space before the # sign.

fix template (sonic-net#20195)

6673a1e

cherry-pick sonic-net#20005 Approach What is the motivation for this PR? Add the missing template for t1-isolated topo. How did you do it? Add the missing template for t1-isolated topo. signed-off-by: [email protected]

Add xfail for generic hash case test_lag_hash due to github issue 225…

e37e33b

…86 (sonic-net#19909) Add xfail for generic hash case test_lag_hash due to github issue sonic-net/sonic-buildimage#22586

added a retry of sending traffic in ptf test_fib in case packet wasn'…

17ab5c3

…t correctly sent to the DUT (sonic-net#18012) In case of a weak nic, when a packet is not received, resend the packet Related PR: sonic-net#14139

Add function to set counter poll interval (sonic-net#20062)

bf88f63

Add function to set counter poll interval

Enhance the qos_remap test case with flushing the buffer (sonic-net#2…

dc462c3

…0212) Signed-off-by: Kevin Wang <[email protected]>

Fix flakiness in pfcwd/test_pfcwd_cli.py

b97aa13

vivekverma-arista requested review from a team, StormLiangMS, bingwang-ms, prgeor, wangxin and yxieca as code owners August 14, 2025 06:12

vivekverma-arista closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix test pfcwd cli 202505#20246

Fix test pfcwd cli 202505#20246
vivekverma-arista wants to merge 471 commits intosonic-net:masterfrom
vivekverma-arista:fix-test-pfcwd-cli-202505

vivekverma-arista commented Aug 14, 2025

Uh oh!

mssonicbld commented Aug 14, 2025

Uh oh!

azure-pipelines bot commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

vivekverma-arista commented Aug 14, 2025

Description of PR

Type of change

Back port request

Approach

What is the motivation for this PR?

How did you do it?

How did you verify/test it?

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

Uh oh!

mssonicbld commented Aug 14, 2025

Uh oh!

azure-pipelines bot commented Aug 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants