Skip to content

Fix MACsec test reliability and configuration issues#21372

Merged
arlakshm merged 5 commits intosonic-net:masterfrom
nexthop-ai:rajshekhar-nexthop.macsec_test_fixes
Feb 9, 2026
Merged

Fix MACsec test reliability and configuration issues#21372
arlakshm merged 5 commits intosonic-net:masterfrom
nexthop-ai:rajshekhar-nexthop.macsec_test_fixes

Conversation

@rajshekhar-nexthop
Copy link
Contributor

@rajshekhar-nexthop rajshekhar-nexthop commented Nov 20, 2025

Description of PR

This PR addresses three issues related to MACsec testing and configuration in sonic-mgmt:

  1. Fix JSON syntax error in T2 golden config template - Corrects malformed JSON in golden_config_db_t2.j2 that caused parsing errors when MACsec was enabled
  2. Avoid redundant MACsec profile configuration - Prevents generating MACsec golden config when no profile is defined in the prepare phase
  3. Improve MACsec docker restart test reliability - Adds systemd StartLimitHit protection to prevent test flakiness during rapid restart cycles
  4. Fix MACsec test race and clean-up sync- Ensure MKA establishment before pre-loading MACsec session info for tests and provide a helper to wait for MACsec DB cleanup after disabling MACsec

Summary:
These changes improve MACsec test stability and fix configuration generation issues that occur when MACsec is enabled on T2 testbeds.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
    • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

Issue 1 - JSON Syntax Error:

  • When preparing DUT with MACsec enabled, the golden config template generated invalid JSON causing: json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes

Issue 2 - Redundant Configuration:

  • MACsec profile configuration was being generated during PREPARE phase even when no profile was defined, leading to redundant override configs
  • The actual MACsec profile should only be set during RUN phase via set_macsec_profile() using direct sonic-db-cli commands

Issue 3 - Test Flakiness:

  • MACsec docker restart tests intermittently failed when systemd enforced StartLimitHit due to rapid restart attempts during teardown/restart cycles
  • No mechanism existed to handle systemd rate limiting gracefully

Issue 4 - Test Race:

  • Tests fail with exceptions like KeyError('sak') when the MACsec egress SA row does not yet exist, even though MACSEC_PORT_TABLE already shows enable_encrypt="true".
  • There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished

How did you do it?

Fix 1 - JSON Template:

  • Corrected JSON syntax in ansible/templates/golden_config_db_t2.j2 to ensure valid JSON output

Fix 2 - Conditional Config Generation:

  • Modified ansible/config_sonic_basedon_testbed.yml to only generate MACsec golden config when macsec_profile is actually defined
  • This aligns with the two-phase approach:
    • PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config
    • RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update

Fix 3 - StartLimitHit Guard:

  • Added new helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that:
    • Proactively clears systemd failure counters (systemctl reset-failed)
    • Detects systemd rate limiting (StartLimitHit)
    • Applies bounded backoff (default 35s) when rate-limited
    • Executes systemctl start after backoff
    • Verifies the target container becomes running within timeout
  • Updated tests/macsec/test_docker_restart.py to use the new helper instead of direct duthost.restart_service("macsec")

Fix 4 -Wait for MKA and Cleanup:

  • By tying load_macsec_info to wait_mka_establish where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis
  • Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable wait_for_macsec_cleanup helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps

How did you verify/test it?

Fix 1 & 2:

  • Verified T2 testbed preparation with MACsec enabled completes without JSON parsing errors
  • Confirmed no redundant MACsec profile configuration is generated when profile is undefined
  • Validated that MACsec profile is correctly applied during RUN phase via set_macsec_profile()

Fix 3:

  • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled in lab environment
  • Repeated restart sequences to emulate systemd rate limiting scenarios
  • Verified the helper reliably recovers from StartLimitHit and container reaches running state within timeout
  • Confirmed reduced test flakiness in repeated test runs

Fix 4:
Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed to confirm:

  • MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before load_all_macsec_info is invoked.
  • Disabling MACsec followed by wait_for_macsec_cleanup results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window.

Any platform specific information?

These fixes apply to all platforms that support MACsec, particularly affecting T2 topology testbeds.

Supported testbed topology if it's a new test case?

N/A - These are improvements to existing test infrastructure and configuration templates.

Documentation

No documentation updates required as these are bug fixes and internal test improvements that don't change user-facing behavior or add new features.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rajshekhar-nexthop rajshekhar-nexthop force-pushed the rajshekhar-nexthop.macsec_test_fixes branch from 843ecc1 to 8048cad Compare November 20, 2025 05:18
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rajshekhar-nexthop rajshekhar-nexthop force-pushed the rajshekhar-nexthop.macsec_test_fixes branch from 8048cad to d917371 Compare November 20, 2025 06:17
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rajshekhar-nexthop rajshekhar-nexthop marked this pull request as ready for review November 24, 2025 04:35
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@rajshekhar-nexthop rajshekhar-nexthop force-pushed the rajshekhar-nexthop.macsec_test_fixes branch from 35697aa to 7c45afd Compare December 15, 2025 07:56
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

…t#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
…nic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
…tart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
@arlakshm
Copy link
Contributor

@liamkearney-msft @judyjoseph @Pterosaur please help signoff on this PR

Copy link
Contributor

@liamkearney-msft liamkearney-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few comments regarding minor code changes, but otherwise these changes look good - thanks for the enhancements

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
@mssonicbld
Copy link
Collaborator

/azp run

@github-actions github-actions bot requested review from r12f, sdszhang and wangxin January 5, 2026 04:32
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

@liamkearney-msft liamkearney-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes - looks great

@rajshekhar-nexthop
Copy link
Contributor Author

rajshekhar-nexthop commented Jan 6, 2026

Hi @wangxin , @yxieca , Please check and merge this PR. Thank you @liamkearney-msft for the code review

@arlakshm
Copy link
Contributor

@judyjoseph, @tjchadaga can you please help signoff

@rajshekhar-nexthop
Copy link
Contributor Author

Can we get this PR merged? Thanks

@rajshekhar-nexthop
Copy link
Contributor Author

Gentle reminder. Please merge this PR. Thanks

@arlakshm arlakshm merged commit 75e4192 into sonic-net:master Feb 9, 2026
21 checks passed
mssonicbld pushed a commit to mssonicbld/sonic-mgmt that referenced this pull request Feb 9, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
@mssonicbld
Copy link
Collaborator

Cherry-pick PR to 202511: #22298

nnelluri-cisco pushed a commit to nnelluri-cisco/sonic-mgmt that referenced this pull request Feb 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
Signed-off-by: nnelluri-cisco <nnelluri@cisco.com>
anilal-amd pushed a commit to anilal-amd/anilal-forked-sonic-mgmt that referenced this pull request Feb 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
Signed-off-by: Zhuohui Tan <zhuohui.tan@amd.com>
ravaliyel pushed a commit to ravaliyel/sonic-mgmt that referenced this pull request Mar 12, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
Signed-off-by: Ravali Yeluri (WIPRO LIMITED) <v-ryeluri@microsoft.com>
abhishek-nexthop pushed a commit to nexthop-ai/sonic-mgmt that referenced this pull request Mar 17, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
Signed-off-by: Abhishek <abhishek@nexthop.ai>
venu-nexthop pushed a commit to venu-nexthop/sonic-mgmt that referenced this pull request Mar 19, 2026
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>

---------

Signed-off-by: rajshekhar <rajshekhar@nexthop.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants