[action] [PR:21372] Fix MACsec test reliability and configuration issues#22298
Open
mssonicbld wants to merge 1 commit intosonic-net:202511from
Open
[action] [PR:21372] Fix MACsec test reliability and configuration issues#22298mssonicbld wants to merge 1 commit intosonic-net:202511from
mssonicbld wants to merge 1 commit intosonic-net:202511from
Conversation
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]>
11 tasks
Collaborator
Author
|
Original PR: #21372 |
Collaborator
Author
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run Azure.sonic-mgmt |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
Collaborator
|
/azp run Azure.sonic-mgmt |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Collaborator
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of PR
This PR addresses three issues related to MACsec testing and configuration in sonic-mgmt:
golden_config_db_t2.j2that caused parsing errors when MACsec was enabledSummary:
These changes improve MACsec test stability and fix configuration generation issues that occur when MACsec is enabled on T2 testbeds.
Fixes # (issue)
Type of change
Back port request
Approach
What is the motivation for this PR?
Issue 1 - JSON Syntax Error:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotesIssue 2 - Redundant Configuration:
set_macsec_profile()using directsonic-db-clicommandsIssue 3 - Test Flakiness:
Issue 4 - Test Race:
How did you do it?
Fix 1 - JSON Template:
ansible/templates/golden_config_db_t2.j2to ensure valid JSON outputFix 2 - Conditional Config Generation:
ansible/config_sonic_basedon_testbed.ymlto only generate MACsec golden config whenmacsec_profileis actually definedgenerate_t2_golden_config_db()→ template rendering → file-based configset_macsec_profile()→ direct sonic-db-cli commands → immediate CONFIG_DB updateFix 3 - StartLimitHit Guard:
restart_service_with_startlimit_guard()intests/common/helpers/dut_utils.pythat:systemctl reset-failed)systemctl startafter backofftests/macsec/test_docker_restart.pyto use the new helper instead of directduthost.restart_service("macsec")Fix 4 -Wait for MKA and Cleanup:
How did you verify/test it?
Fix 1 & 2:
set_macsec_profile()Fix 3:
tests/macsec/test_docker_restart.py::test_restart_macsec_dockerwith MACsec enabled in lab environmentFix 4:
Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed to confirm:
Any platform specific information?
These fixes apply to all platforms that support MACsec, particularly affecting T2 topology testbeds.
Supported testbed topology if it's a new test case?
N/A - These are improvements to existing test infrastructure and configuration templates.
Documentation
No documentation updates required as these are bug fixes and internal test improvements that don't change user-facing behavior or add new features.