Skip to content

[fanout]: remove vrf management in arista fanout deploy templates#562

Merged
sihuihan88 merged 1 commit intosonic-net:masterfrom
sihuihan88:dev/sihan/fanout
Apr 4, 2018
Merged

[fanout]: remove vrf management in arista fanout deploy templates#562
sihuihan88 merged 1 commit intosonic-net:masterfrom
sihuihan88:dev/sihan/fanout

Conversation

@sihuihan88
Copy link
Contributor

@sihuihan88 sihuihan88 commented Apr 4, 2018

Signed-off-by: Sihui Han [email protected]

Description of PR

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Approach

How did you do it?
Remove vrf management configuration on Arista fanouts.
How did you verify/test it?
Tested on DUT.
Any platform specific information?
Supported testbed topology if it's a new test case?

Documentation

@sihuihan88 sihuihan88 requested a review from yxieca April 4, 2018 17:44
!
interface Management 1
description TO LAB MGMT SWITCH
vrf forwarding management
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guess this is required for management traffic isolation. Why do we've to remove this?

@pavel-shirshov
Copy link
Contributor

With this patch we have to check all our management routing and dataplane routing carefully.
Otherwise we'll see some unexpected behavior from the veos.
I'd better to avoid to make such changes. It gives more issues, than benefits.

@sihuihan88
Copy link
Contributor Author

The fanout switches are basically layer-2 switches and not configured bgp or ip. we should be fine. We tested with the new template and all nightly tests have passed.

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right

@pavel-shirshov
Copy link
Contributor

Sihui, You're right. Sorry I didn't read your PR carefully.

@sihuihan88 sihuihan88 merged commit bd5cb6c into sonic-net:master Apr 4, 2018
@sihuihan88 sihuihan88 deleted the dev/sihan/fanout branch April 4, 2018 20:11
praveen-li pushed a commit to praveen-li/sonic-mgmt that referenced this pull request Jun 20, 2019
* msft_github/master: (111 commits)
  add disconnect/connect vm to testbed-cli.sh (sonic-net#566)
  [dhcp_relay] Increase sleep duration to allow LAG and BGP to come up (sonic-net#565)
  [pfcwd]: support t0-116 and clean up the code (sonic-net#563)
  [fanout]: remove vrf management in arista fanout deploy templates (sonic-net#562)
  [fanout-switch-deploy] Support multiple speeds and port breakout (sonic-net#561)
  [extract_log] Improve extract_log script (sonic-net#559)
  Disable upgrade_sonic retry (sonic-net#560)
  Support for Sonic fanout (sonic-net#555)
  [Fanout deploy template] enable root user on fanout switches (sonic-net#557)
  [link state] match exact dut name in the link list (sonic-net#556)
  [pfcwd]: cache the ansible facts (sonic-net#554)
  Fix kernel version check (sonic-net#553)
  [pfcwd]: increase the pause waiting time and ingore snmp errors (sonic-net#551)
  fix typo (sonic-net#552)
  [dir_bcast] enable dir_bcast test on t0-116 topology (sonic-net#550)
  use command to gather host distribution, kernel version facts (sonic-net#549)
  [minigraph templte] use consistent VLAN subnet (sonic-net#546)
  [VM config] skip podset 0 tor 0 routing entry (sonic-net#545)
  Remove job minigraph_facts from boot_onie (sonic-net#548)
  [fast-reboot] pass VM IP in as ASCII strings (sonic-net#547)
  [fast-reboot test] fix syslog reading issue (sonic-net#543)
  add dataacl to minigraph template (sonic-net#544)
  [service_acl] Make test reliable when testing Arista service ACL solution (sonic-net#542)
  [pfcwd]: Iterate functional test over all ports (sonic-net#490)
  [service_acl] Detect expected output message even if it is followed by other text (sonic-net#541)
  [testbed]: Remove connection local for port_alias module (sonic-net#540)
  Revert "[minigraph-gen] fix AclInterface entries in minigraph (sonic-net#538)" (sonic-net#539)
  [minigraph-gen] fix AclInterface entries in minigraph (sonic-net#538)
  add retries in onie installation (sonic-net#537)
  Use connection plugin to install sonic image in ONIE. (sonic-net#536)
  [dhcp_relay]: Add --relax flag to ptf command (sonic-net#535)
  [minigraph_facts] use minigraph on DUT (sonic-net#534)
  Fix minigraph_facts: mkdir recursively (sonic-net#533)
  [minigraph_facts] use mingraph on DUT to test (sonic-net#532)
  generate minigraph based on topology file (sonic-net#531)
  Need to double-escape when using 'args' syntax (sonic-net#529)
  Fix improper 'local_action' syntax (sonic-net#528)
  Unify style of 'wait_for' actions across playbooks (sonic-net#527)
  [minigraph_facts] retrieving dhcp server list from vlan configuration instead of DhcpResources (sonic-net#526)
  [snmp_facts] increase get command timeout to fix cpu test failure (sonic-net#525)
  [everflow_test]: Add copy ptftests folder to use the remote.py file (sonic-net#522)
  Fix snmp_facts on PSU oid (sonic-net#520)
  Fix snmp queue test (sonic-net#519)
  [lag_test]: Remove the unnecessary testbed_type check (sonic-net#518)
  [ip_decap_test]: Support t0-64 topology (sonic-net#517)
  [acl_test]: Copy ptftests folder for the remote.py file (sonic-net#516)
  Add test case for PSU (sonic-net#514)
  [mtu]: Add t1-64-lag topology support for MTU test (sonic-net#513)
  [ip_decap]: Add t1-64-lag support in the script for the list of source port (sonic-net#512)
  [everflow]: Add missing spaces in ptf command (sonic-net#511)
  [acl_test]: Add ptf_platform_dir: ptftests to use customized platform code to support 64 ports (sonic-net#510)
  [crm]: Implement test for CRM (sonic-net#473)
  [everflow]: Add support for t1-64-lag topology (sonic-net#502)
  Fix sonic_image_version: get from sonic_version.yml, no dependency on grub (sonic-net#508)
  [topology]: Update t1-64-lag topology template to add AclInterfaces piece (sonic-net#505)
  Pull syncd-rpc with sonic version tag (sonic-net#507)
  Remove leading and trailing whitespaces when reading veos file (sonic-net#506)
  [lag_2] remove hard coded interval_count so it can be set by test (sonic-net#503)
  ptf_runner: Add one line comment for ptf_platform_dir (sonic-net#501)
  [testbed]: add port speed and fec configuration in sonic fanout (sonic-net#498)
  [typo]: Replace string t1-lag-64 with t1-64-lag (sonic-net#499)
  Adding sensor data for S6100 (sonic-net#496)
  [fib_test]: Add t1-64-lag src_ports in FIB test (sonic-net#497)
  Fix typo in acl test case name (sonic-net#494)
  Add one more Mellanox SKU string in everflow_tb_test script (sonic-net#495)
  Adding sensor data for Z9100 (sonic-net#492)
  [service_acl] Make test more robust and efficient (sonic-net#489)
  [dhcp relay test] adding more test scenarios (sonic-net#440)
  fix sanity check failed to recover (sonic-net#488)
  Fix PFC_WD test (sonic-net#479)
  add sonic fanout support (sonic-net#485)
  Update README.test.md
  Update README.test.md
  Fix table caption in testbed.csv and documentation (sonic-net#482)
  [test case] Add test: restart swss service (sonic-net#483)
  Add test case port toggle (sonic-net#484)
  [test infrastructure] allow overriding recover system actioin (sonic-net#480)
  [SNMP]add new SNMP counters tests to snmp.yml (sonic-net#477)
  add t0-52 topology (sonic-net#476)
  add command line option for creategraph.py (sonic-net#475)
  Add support for additional timestamp format (sonic-net#474)
  Ignore ansible output in extract_log (sonic-net#472)
  Add extract_logs action to concatenate logs after log rotate (sonic-net#471)
  Add ACL ICMP test (sonic-net#2) (sonic-net#465)
  [pfcwd]:add docker exec to avoid tty error (sonic-net#470)
  [test_tag]remove pfc_wd test tag from test by tag main yaml (sonic-net#469)
  [ansible]gather fact by default (sonic-net#468)
  [sensors]fix Dell S6000 sensors test fail (sonic-net#462)
  [vlan]improve test to wait a bit longer for config reload (sonic-net#463)
  one more place for hwsku of new Mellanox 2700 (sonic-net#464)
  [sensors]sensors test add new hwsku for Mellanox 2700 (sonic-net#461)
  [PFCWD]: test enhancement (sonic-net#456)
  [sensors] remove redundant sensor data definitin for sku MSN2100 (sonic-net#460)
  when call testcase by name, fetch all vms management info from testbed_facts (sonic-net#457)
  [acl test] error in ACL rule json file for destination ip (sonic-net#434)
  [sensor data] add sensor data for Mellanox MSN2410 and MSN2100 (sonic-net#449)
  Fix sku-sensors-data for 7050-QX32 (sonic-net#454)
  [deploy minigraph]add default enable BGP to deploy minigraph step (sonic-net#455)
  [upgrade]save bgp UP state after they are brough up (sonic-net#453)
  [reboot test] call sudo reboot to reboot dut (sonic-net#452)
  ...
rajshekhar-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Nov 20, 2025
…tart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement


### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
rajshekhar-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Nov 20, 2025
…tart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
rajshekhar-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Nov 20, 2025
…tart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
rajshekhar-nexthop added a commit to nexthop-ai/sonic-mgmt that referenced this pull request Dec 15, 2025
…tart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>
auspham pushed a commit to auspham/sonic-mgmt that referenced this pull request Feb 3, 2026
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
 - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
Update the topo for upper T2
Details
 - uplinks are connected to odd ports
 - Downlinks are connected to even ports
 - Uplinks are single or 4 port portchannels
 - Downlinks are non portchannels

#### How did you do it?
Update the t2_single_node_max topo
#### How did you verify/test it?

#### Any platform specific information?
No
#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
arlakshm pushed a commit that referenced this pull request Feb 9, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 9, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Feb 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
Signed-off-by: nnelluri-cisco <[email protected]>
anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
Signed-off-by: Zhuohui Tan <[email protected]>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <[email protected]>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
Signed-off-by: Abhishek <[email protected]>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants