[fanout]: remove vrf management in arista fanout deploy templates#562
Merged
sihuihan88 merged 1 commit intosonic-net:masterfrom Apr 4, 2018
Merged
[fanout]: remove vrf management in arista fanout deploy templates#562sihuihan88 merged 1 commit intosonic-net:masterfrom
sihuihan88 merged 1 commit intosonic-net:masterfrom
Conversation
Signed-off-by: Sihui Han <[email protected]>
prsunny
reviewed
Apr 4, 2018
| ! | ||
| interface Management 1 | ||
| description TO LAB MGMT SWITCH | ||
| vrf forwarding management |
Contributor
There was a problem hiding this comment.
Guess this is required for management traffic isolation. Why do we've to remove this?
Contributor
|
With this patch we have to check all our management routing and dataplane routing carefully. |
Contributor
Author
|
The fanout switches are basically layer-2 switches and not configured bgp or ip. we should be fine. We tested with the new template and all nightly tests have passed. |
Contributor
|
Sihui, You're right. Sorry I didn't read your PR carefully. |
yxieca
approved these changes
Apr 4, 2018
praveen-li
pushed a commit
to praveen-li/sonic-mgmt
that referenced
this pull request
Jun 20, 2019
* msft_github/master: (111 commits) add disconnect/connect vm to testbed-cli.sh (sonic-net#566) [dhcp_relay] Increase sleep duration to allow LAG and BGP to come up (sonic-net#565) [pfcwd]: support t0-116 and clean up the code (sonic-net#563) [fanout]: remove vrf management in arista fanout deploy templates (sonic-net#562) [fanout-switch-deploy] Support multiple speeds and port breakout (sonic-net#561) [extract_log] Improve extract_log script (sonic-net#559) Disable upgrade_sonic retry (sonic-net#560) Support for Sonic fanout (sonic-net#555) [Fanout deploy template] enable root user on fanout switches (sonic-net#557) [link state] match exact dut name in the link list (sonic-net#556) [pfcwd]: cache the ansible facts (sonic-net#554) Fix kernel version check (sonic-net#553) [pfcwd]: increase the pause waiting time and ingore snmp errors (sonic-net#551) fix typo (sonic-net#552) [dir_bcast] enable dir_bcast test on t0-116 topology (sonic-net#550) use command to gather host distribution, kernel version facts (sonic-net#549) [minigraph templte] use consistent VLAN subnet (sonic-net#546) [VM config] skip podset 0 tor 0 routing entry (sonic-net#545) Remove job minigraph_facts from boot_onie (sonic-net#548) [fast-reboot] pass VM IP in as ASCII strings (sonic-net#547) [fast-reboot test] fix syslog reading issue (sonic-net#543) add dataacl to minigraph template (sonic-net#544) [service_acl] Make test reliable when testing Arista service ACL solution (sonic-net#542) [pfcwd]: Iterate functional test over all ports (sonic-net#490) [service_acl] Detect expected output message even if it is followed by other text (sonic-net#541) [testbed]: Remove connection local for port_alias module (sonic-net#540) Revert "[minigraph-gen] fix AclInterface entries in minigraph (sonic-net#538)" (sonic-net#539) [minigraph-gen] fix AclInterface entries in minigraph (sonic-net#538) add retries in onie installation (sonic-net#537) Use connection plugin to install sonic image in ONIE. (sonic-net#536) [dhcp_relay]: Add --relax flag to ptf command (sonic-net#535) [minigraph_facts] use minigraph on DUT (sonic-net#534) Fix minigraph_facts: mkdir recursively (sonic-net#533) [minigraph_facts] use mingraph on DUT to test (sonic-net#532) generate minigraph based on topology file (sonic-net#531) Need to double-escape when using 'args' syntax (sonic-net#529) Fix improper 'local_action' syntax (sonic-net#528) Unify style of 'wait_for' actions across playbooks (sonic-net#527) [minigraph_facts] retrieving dhcp server list from vlan configuration instead of DhcpResources (sonic-net#526) [snmp_facts] increase get command timeout to fix cpu test failure (sonic-net#525) [everflow_test]: Add copy ptftests folder to use the remote.py file (sonic-net#522) Fix snmp_facts on PSU oid (sonic-net#520) Fix snmp queue test (sonic-net#519) [lag_test]: Remove the unnecessary testbed_type check (sonic-net#518) [ip_decap_test]: Support t0-64 topology (sonic-net#517) [acl_test]: Copy ptftests folder for the remote.py file (sonic-net#516) Add test case for PSU (sonic-net#514) [mtu]: Add t1-64-lag topology support for MTU test (sonic-net#513) [ip_decap]: Add t1-64-lag support in the script for the list of source port (sonic-net#512) [everflow]: Add missing spaces in ptf command (sonic-net#511) [acl_test]: Add ptf_platform_dir: ptftests to use customized platform code to support 64 ports (sonic-net#510) [crm]: Implement test for CRM (sonic-net#473) [everflow]: Add support for t1-64-lag topology (sonic-net#502) Fix sonic_image_version: get from sonic_version.yml, no dependency on grub (sonic-net#508) [topology]: Update t1-64-lag topology template to add AclInterfaces piece (sonic-net#505) Pull syncd-rpc with sonic version tag (sonic-net#507) Remove leading and trailing whitespaces when reading veos file (sonic-net#506) [lag_2] remove hard coded interval_count so it can be set by test (sonic-net#503) ptf_runner: Add one line comment for ptf_platform_dir (sonic-net#501) [testbed]: add port speed and fec configuration in sonic fanout (sonic-net#498) [typo]: Replace string t1-lag-64 with t1-64-lag (sonic-net#499) Adding sensor data for S6100 (sonic-net#496) [fib_test]: Add t1-64-lag src_ports in FIB test (sonic-net#497) Fix typo in acl test case name (sonic-net#494) Add one more Mellanox SKU string in everflow_tb_test script (sonic-net#495) Adding sensor data for Z9100 (sonic-net#492) [service_acl] Make test more robust and efficient (sonic-net#489) [dhcp relay test] adding more test scenarios (sonic-net#440) fix sanity check failed to recover (sonic-net#488) Fix PFC_WD test (sonic-net#479) add sonic fanout support (sonic-net#485) Update README.test.md Update README.test.md Fix table caption in testbed.csv and documentation (sonic-net#482) [test case] Add test: restart swss service (sonic-net#483) Add test case port toggle (sonic-net#484) [test infrastructure] allow overriding recover system actioin (sonic-net#480) [SNMP]add new SNMP counters tests to snmp.yml (sonic-net#477) add t0-52 topology (sonic-net#476) add command line option for creategraph.py (sonic-net#475) Add support for additional timestamp format (sonic-net#474) Ignore ansible output in extract_log (sonic-net#472) Add extract_logs action to concatenate logs after log rotate (sonic-net#471) Add ACL ICMP test (sonic-net#2) (sonic-net#465) [pfcwd]:add docker exec to avoid tty error (sonic-net#470) [test_tag]remove pfc_wd test tag from test by tag main yaml (sonic-net#469) [ansible]gather fact by default (sonic-net#468) [sensors]fix Dell S6000 sensors test fail (sonic-net#462) [vlan]improve test to wait a bit longer for config reload (sonic-net#463) one more place for hwsku of new Mellanox 2700 (sonic-net#464) [sensors]sensors test add new hwsku for Mellanox 2700 (sonic-net#461) [PFCWD]: test enhancement (sonic-net#456) [sensors] remove redundant sensor data definitin for sku MSN2100 (sonic-net#460) when call testcase by name, fetch all vms management info from testbed_facts (sonic-net#457) [acl test] error in ACL rule json file for destination ip (sonic-net#434) [sensor data] add sensor data for Mellanox MSN2410 and MSN2100 (sonic-net#449) Fix sku-sensors-data for 7050-QX32 (sonic-net#454) [deploy minigraph]add default enable BGP to deploy minigraph step (sonic-net#455) [upgrade]save bgp UP state after they are brough up (sonic-net#453) [reboot test] call sudo reboot to reboot dut (sonic-net#452) ...
rajshekhar-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Nov 20, 2025
…tart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
rajshekhar-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Nov 20, 2025
…tart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
rajshekhar-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Nov 20, 2025
…tart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
rajshekhar-nexthop
added a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Dec 15, 2025
…tart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]>
auspham
pushed a commit
to auspham/sonic-mgmt
that referenced
this pull request
Feb 3, 2026
<!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach Update the topo for upper T2 Details - uplinks are connected to odd ports - Downlinks are connected to even ports - Uplinks are single or 4 port portchannels - Downlinks are non portchannels #### How did you do it? Update the t2_single_node_max topo #### How did you verify/test it? #### Any platform specific information? No #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? -->
arlakshm
pushed a commit
that referenced
this pull request
Feb 9, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]>
mssonicbld
pushed a commit
to mssonicbld/sonic-mgmt
that referenced
this pull request
Feb 9, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]>
nnelluri-cisco
pushed a commit
to nnelluri-cisco/sonic-mgmt
that referenced
this pull request
Feb 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]> Signed-off-by: nnelluri-cisco <[email protected]>
anilal-amd
pushed a commit
to anilal-amd/anilal-forked-sonic-mgmt
that referenced
this pull request
Feb 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]> Signed-off-by: Zhuohui Tan <[email protected]>
ravaliyel
pushed a commit
to ravaliyel/sonic-mgmt
that referenced
this pull request
Mar 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]> Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
abhishek-nexthop
pushed a commit
to nexthop-ai/sonic-mgmt
that referenced
this pull request
Mar 17, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]> Signed-off-by: Abhishek <[email protected]>
venu-nexthop
pushed a commit
to venu-nexthop/sonic-mgmt
that referenced
this pull request
Mar 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Signed-off-by: Sihui Han [email protected]
Description of PR
Fixes # (issue)
Type of change
Approach
How did you do it?
Remove vrf management configuration on Arista fanouts.
How did you verify/test it?
Tested on DUT.
Any platform specific information?
Supported testbed topology if it's a new test case?
Documentation