Skip to content

[action] [PR:21372] Fix MACsec test reliability and configuration issues#22298

Open
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/21372
Open

[action] [PR:21372] Fix MACsec test reliability and configuration issues#22298
mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld:cherry/202511/21372

Conversation

@mssonicbld
Copy link
Collaborator

Description of PR

This PR addresses three issues related to MACsec testing and configuration in sonic-mgmt:

  1. Fix JSON syntax error in T2 golden config template - Corrects malformed JSON in golden_config_db_t2.j2 that caused parsing errors when MACsec was enabled
  2. Avoid redundant MACsec profile configuration - Prevents generating MACsec golden config when no profile is defined in the prepare phase
  3. Improve MACsec docker restart test reliability - Adds systemd StartLimitHit protection to prevent test flakiness during rapid restart cycles
  4. Fix MACsec test race and clean-up sync- Ensure MKA establishment before pre-loading MACsec session info for tests and provide a helper to wait for MACsec DB cleanup after disabling MACsec

Summary:
These changes improve MACsec test stability and fix configuration generation issues that occur when MACsec is enabled on T2 testbeds.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • New Test case
  • Skipped for non-supported platforms
  • Test case improvement

Back port request

  • 202205
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Approach

What is the motivation for this PR?

Issue 1 - JSON Syntax Error:

  • When preparing DUT with MACsec enabled, the golden config template generated invalid JSON causing: json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes

Issue 2 - Redundant Configuration:

  • MACsec profile configuration was being generated during PREPARE phase even when no profile was defined, leading to redundant override configs
  • The actual MACsec profile should only be set during RUN phase via set_macsec_profile() using direct sonic-db-cli commands

Issue 3 - Test Flakiness:

  • MACsec docker restart tests intermittently failed when systemd enforced StartLimitHit due to rapid restart attempts during teardown/restart cycles
  • No mechanism existed to handle systemd rate limiting gracefully

Issue 4 - Test Race:

  • Tests fail with exceptions like KeyError('sak') when the MACsec egress SA row does not yet exist, even though MACSEC_PORT_TABLE already shows enable_encrypt="true".
  • There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished

How did you do it?

Fix 1 - JSON Template:

  • Corrected JSON syntax in ansible/templates/golden_config_db_t2.j2 to ensure valid JSON output

Fix 2 - Conditional Config Generation:

  • Modified ansible/config_sonic_basedon_testbed.yml to only generate MACsec golden config when macsec_profile is actually defined
  • This aligns with the two-phase approach:
  • PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config
  • RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update

Fix 3 - StartLimitHit Guard:

  • Added new helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that:
  • Proactively clears systemd failure counters (systemctl reset-failed)
  • Detects systemd rate limiting (StartLimitHit)
  • Applies bounded backoff (default 35s) when rate-limited
  • Executes systemctl start after backoff
  • Verifies the target container becomes running within timeout
  • Updated tests/macsec/test_docker_restart.py to use the new helper instead of direct duthost.restart_service("macsec")

Fix 4 -Wait for MKA and Cleanup:

  • By tying load_macsec_info to wait_mka_establish where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis
  • Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable wait_for_macsec_cleanup helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps

How did you verify/test it?

Fix 1 & 2:

  • Verified T2 testbed preparation with MACsec enabled completes without JSON parsing errors
  • Confirmed no redundant MACsec profile configuration is generated when profile is undefined
  • Validated that MACsec profile is correctly applied during RUN phase via set_macsec_profile()

Fix 3:

  • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled in lab environment
  • Repeated restart sequences to emulate systemd rate limiting scenarios
  • Verified the helper reliably recovers from StartLimitHit and container reaches running state within timeout
  • Confirmed reduced test flakiness in repeated test runs

Fix 4:
Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed to confirm:

  • MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before load_all_macsec_info is invoked.
  • Disabling MACsec followed by wait_for_macsec_cleanup results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window.

Any platform specific information?

These fixes apply to all platforms that support MACsec, particularly affecting T2 topology testbeds.

Supported testbed topology if it's a new test case?

N/A - These are improvements to existing test infrastructure and configuration templates.

Documentation

No documentation updates required as these are bug fixes and internal test improvements that don't change user-facing behavior or add new features.

* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update

Summary:
Fixes # (issue)

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ ] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?

#### How did you do it?

#### How did you verify/test it?

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)

<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md

Please provide following information to help code review process a bit
easier:
-->
### Description of PR

<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->

Summary:
 • Add a StartLimitHit-safe restart helper and use it in
       MACsec docker restart test to reduce flakiness
     • New helper restart_service_with_startlimit_guard() in
       tests/common/helpers/dut_utils.py:
        • Proactively clears systemd failure counters (systemctl
          reset-failed)
        • Attempts restart, detects systemd rate limiting
          (StartLimitHit), applies bounded backoff (default 35s),
          then start
        • Verifies the target container becomes running within a
          timeout
     • Update tests/macsec/test_docker_restart.py to use the new
       helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
       systemd rate limiting after repeated restarts during
       teardown/restart cycles.
     • Guarding against StartLimitHit with a clear
       backoff-and-start flow improves test reliability without
       changing device behavior.

### Type of change

<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->

- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
    - [ ] Skipped for non-supported platforms
- [ x] Test case improvement

### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505

### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
       the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
        • Detects StartLimitHit pre/post restart attempts
        • Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.

#### How did you verify/test it?
 • Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.

#### Any platform specific information?

#### Supported testbed topology if it's a new test case?

### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->

Signed-off-by: rajshekhar <[email protected]>

* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)

NOS-3311 tracks MACsec test flakiness caused by races between:

* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).

This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.

This PR adds two pieces of synchronization in sonic-mgmt:

1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec

File: `tests/common/macsec/__init__.py`

* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:

    ```python
    try:
        request.getfixturevalue('wait_mka_establish')
    except Exception:
        pass
    ```

* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.

This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.

cleanup

File: `tests/common/macsec/macsec_config_helper.py`

* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
  * disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
  * For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
    * Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.

* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.

* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.

---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author

Signed-off-by: rajshekhar <[email protected]>

* taking care of review comments

- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.

- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.

- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.

Signed-off-by: rajshekhar <[email protected]>

---------

Signed-off-by: rajshekhar <[email protected]>
@mssonicbld
Copy link
Collaborator Author

Original PR: #21372

@mssonicbld
Copy link
Collaborator Author

/azp run

@github-actions github-actions bot requested review from r12f and sdszhang February 9, 2026 19:59
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator

/azp run Azure.sonic-mgmt

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@StormLiangMS
Copy link
Collaborator

/azp run Azure.sonic-mgmt

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@StormLiangMS
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants