Commit dfbb6e7
Fix MACsec test reliability and configuration issues (sonic-net#21372)
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401)
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md
Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->
Summary:
Fixes # (issue)
Fixes below json syntax error. It is seen only when dut is prepared with
macsec enable flag.
json.decoder.JSONDecodeError: Expecting property name enclosed in double
quotes: line 2 column 3 (char 4)
### Type of change
<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->
- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
- [ ] Skipped for non-supported platforms
- [ ] Test case improvement
### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
### Approach
#### What is the motivation for this PR?
#### How did you do it?
#### How did you verify/test it?
#### Any platform specific information?
#### Supported testbed topology if it's a new test case?
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Signed-off-by: rajshekhar <[email protected]>
* ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420)
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md
Please provide following information to help code review process a bit
easier:
-->
### Description of PR
Redundant override config is avoided as no macsec profile is set in the
prepare phase.
Below are details how macsec profile configurations are rendered:
PREPARE phase: Uses generate_t2_golden_config_db() → template rendering
→ file-based config
RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands →
immediate CONFIG_DB update
Summary:
Fixes # (issue)
### Type of change
<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->
- [x] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
- [ ] Skipped for non-supported platforms
- [ ] Test case improvement
### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
### Approach
#### What is the motivation for this PR?
#### How did you do it?
#### How did you verify/test it?
#### Any platform specific information?
#### Supported testbed topology if it's a new test case?
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Signed-off-by: rajshekhar <[email protected]>
* ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562)
<!--
Please make sure you've read and understood our contributing guidelines;
https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md
Please provide following information to help code review process a bit
easier:
-->
### Description of PR
<!--
- Please include a summary of the change and which issue is fixed.
- Please also include relevant motivation and context. Where should
reviewer start? background context?
- List any dependencies that are required for this change.
-->
Summary:
• Add a StartLimitHit-safe restart helper and use it in
MACsec docker restart test to reduce flakiness
• New helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py:
• Proactively clears systemd failure counters (systemctl
reset-failed)
• Attempts restart, detects systemd rate limiting
(StartLimitHit), applies bounded backoff (default 35s),
then start
• Verifies the target container becomes running within a
timeout
• Update tests/macsec/test_docker_restart.py to use the new
helper instead of duthost.restart_service("macsec")
Fixes # (issue)
MACsec docker restart tests can intermittently fail due to
systemd rate limiting after repeated restarts during
teardown/restart cycles.
• Guarding against StartLimitHit with a clear
backoff-and-start flow improves test reliability without
changing device behavior.
### Type of change
<!--
- Fill x for your type of change.
- e.g.
- [x] Bug fix
-->
- [ ] Bug fix
- [ ] Testbed and Framework(new/improvement)
- [ ] New Test case
- [ ] Skipped for non-supported platforms
- [ x] Test case improvement
### Back port request
- [ ] 202205
- [ ] 202305
- [ ] 202311
- [ ] 202405
- [ ] 202411
- [ ] 202505
### Approach
#### What is the motivation for this PR?
• MACsec docker restart tests can intermittently fail when systemd
enforces StartLimitHit due to rapid restart attempts during
teardown/restart cycles.
• This PR makes the restart path resilient to StartLimitHit by
proactively clearing counters, applying bounded backoff, and verifying
the container reaches
the running state, thereby reducing test flakiness.
#### How did you do it?
• Added a helper restart_service_with_startlimit_guard() in
tests/common/helpers/dut_utils.py that:
• Detects StartLimitHit pre/post restart attempts
• Runs systemctl reset-failed to clear counters
• Applies a fixed backoff when rate-limited, then systemctl start
• Verifies the container is running within a configurable timeout using
existing wait_until/state checks
• Updated tests/macsec/test_docker_restart.py to use the helper instead
of a direct duthost.restart_service("macsec") call.
#### How did you verify/test it?
• Local validation in lab:
• Executed
tests/macsec/test_docker_restart.py::test_restart_macsec_docker with
MACsec enabled.
• Repeated the restart sequence to emulate rate limiting scenarios.
• Verified the helper reliably recovers from StartLimitHit and the
container becomes running within the timeout.
#### Any platform specific information?
#### Supported testbed topology if it's a new test case?
### Documentation
<!--
(If it's a new feature, new test case)
Did you update documentation/Wiki relevant to your implementation?
Link to the wiki page?
-->
Signed-off-by: rajshekhar <[email protected]>
* NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678)
NOS-3311 tracks MACsec test flakiness caused by races between:
* wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE
DB), and
* the test harness eagerly reading that state to build `MACSEC_INFO`
(via `get_macsec_attr`).
This can manifest as exceptions like `KeyError('sak')` when the MACsec
egress SA row does not yet exist, even though `MACSEC_PORT_TABLE`
already shows `enable_encrypt="true"`. There are also cleanup races
where tests check for removal of MACsec DB entries before the background
cleanup logic has finished.
This PR adds two pieces of synchronization in sonic-mgmt:
1. Ensure MKA establishment before pre-loading MACsec session info for
tests
2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec
File: `tests/common/macsec/__init__.py`
* The `load_macsec_info` fixture (module-scoped, autouse) previously
called `load_all_macsec_info()` immediately when MACsec was enabled and
a profile was present. That in turn calls `get_macsec_attr()`, which
expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully
programmed.
* In environments where MACsec is pre-configured before tests start,
this created a race: `MACSEC_PORT_TABLE` might already exist (with
`enable_encrypt="true"`), but the egress SA row for the active AN might
not yet have been written to APP_DB, leading to `KeyError('sak')` when
`macsec_sa["sak"]` is accessed.
* Fix:
* When MACsec is enabled and a profile is present, the fixture now first
*attempts* to resolve the `wait_mka_establish` fixture:
```python
try:
request.getfixturevalue('wait_mka_establish')
except Exception:
pass
```
* `wait_mka_establish` is defined in `tests/macsec/conftest.py` and
internally uses `check_appl_db` plus `wait_until(...)` to ensure
APP/STATE DB MACsec SC/SA tables are populated (including
`sak`/`auth_key`/PN relationships) before returning.
* If the fixture is not defined (e.g., in other environments or test
suites), the code falls back to the previous behavior.
* After this synchronization point, if `is_macsec_configured(...)` is
true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for
all control links. Otherwise, the original `macsec_setup` flow is
triggered.
This makes `get_macsec_attr()` execution order consistent with the rest
of the MACsec test suite, which already relies on
`wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and
SAKs exist before validating state.
cleanup
File: `tests/common/macsec/macsec_config_helper.py`
* Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export
it via `__all__`.
* This helper is designed for tests that:
* disable MACsec on one or more interfaces, and then
* need to assert that all associated MACsec entries (port, SC, SA) have
been automatically removed from Redis before proceeding.
* Behavior:
* For EOS neighbors, it is a no-op: they do not use Redis DBs and the
function returns `True` immediately.
* For SONiC hosts, it:
* Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics`
with patterns `MACSEC_*:{interface}*` (APPL_DB) and
`MACSEC_*|{interface}*` (STATE_DB).
* Aggregates any remaining keys per DB.
* Returns `True` as soon as all such keys are gone for the given
interfaces, logging total time taken.
* If the `timeout` is exceeded, logs a warning, prints a summary of
remaining entries, and returns `False`.
* This centralizes the logic for “wait until MACsec entries are gone
from Redis” instead of having ad hoc sleeps or partial checks in
individual tests.
* MACsec control-plane actions (via wpa_supplicant and swss/macsecorch)
are asynchronous relative to the tests. It is valid for
`MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs
and their SAKs are still being programmed.
* `get_macsec_attr()` assumes that:
* APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a
valid `encoding_an`, and
* APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a
`sak` field.
Without synchronization, tests that pre-load `MACSEC_INFO` can hit a
window where the SA row does not yet exist and crash with
`KeyError('sak')`.
* By tying `load_macsec_info` to `wait_mka_establish` where available,
we ensure those pre-loads happen only after the expected MACsec state
has been fully written to Redis.
* Similarly, when disabling MACsec, asynchronous background cleanup can
lag behind the test’s expectations. Having a dedicated, reusable
`wait_for_macsec_cleanup` helper lets future tests explicitly wait for
cleanup completion instead of guessing with sleeps.
* Verified that the new fixtures and helpers are imported and wired
correctly:
* `load_macsec_info` remains `autouse=True` at module scope, so existing
MACsec tests automatically benefit from the additional synchronization.
* `wait_for_macsec_cleanup` is exported in `__all__` for use by future
MACsec tests.
* Manually exercised MACsec configuration and teardown flows in a
MACsec-enabled testbed (e.g., humm120) to confirm:
* MACsec sessions establish successfully and APP/STATE DB contain
expected MACsec entries before `load_all_macsec_info` is invoked.
* Disabling MACsec followed by `wait_for_macsec_cleanup` results in all
MACSEC_* keys being removed from APPL/STATE DB within the timeout
window.
---
Pull Request opened by [Augment Code](https://www.augmentcode.com/) with
guidance from the PR author
Signed-off-by: rajshekhar <[email protected]>
* taking care of review comments
- Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited.
- Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing.
- Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers.
Signed-off-by: rajshekhar <[email protected]>
---------
Signed-off-by: rajshekhar <[email protected]>
Signed-off-by: Abhishek <[email protected]>1 parent a1a5262 commit dfbb6e7
6 files changed
Lines changed: 188 additions & 8 deletions
File tree
- ansible
- templates
- tests
- common
- helpers
- macsec
- macsec
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
756 | 756 | | |
757 | 757 | | |
758 | 758 | | |
759 | | - | |
| 759 | + | |
760 | 760 | | |
761 | 761 | | |
762 | 762 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
26 | | - | |
27 | | - | |
| 26 | + | |
28 | 27 | | |
29 | 28 | | |
30 | 29 | | |
| |||
34 | 33 | | |
35 | 34 | | |
36 | 35 | | |
37 | | - | |
38 | 36 | | |
39 | 37 | | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
126 | 126 | | |
127 | 127 | | |
128 | 128 | | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
| 134 | + | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
| 146 | + | |
| 147 | + | |
| 148 | + | |
| 149 | + | |
| 150 | + | |
| 151 | + | |
| 152 | + | |
| 153 | + | |
| 154 | + | |
| 155 | + | |
| 156 | + | |
| 157 | + | |
| 158 | + | |
| 159 | + | |
| 160 | + | |
| 161 | + | |
| 162 | + | |
| 163 | + | |
| 164 | + | |
| 165 | + | |
| 166 | + | |
| 167 | + | |
| 168 | + | |
| 169 | + | |
| 170 | + | |
| 171 | + | |
| 172 | + | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
129 | 187 | | |
130 | 188 | | |
131 | 189 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
122 | 122 | | |
123 | 123 | | |
124 | 124 | | |
| 125 | + | |
| 126 | + | |
| 127 | + | |
| 128 | + | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
125 | 134 | | |
| 135 | + | |
| 136 | + | |
| 137 | + | |
| 138 | + | |
| 139 | + | |
| 140 | + | |
| 141 | + | |
| 142 | + | |
| 143 | + | |
| 144 | + | |
| 145 | + | |
126 | 146 | | |
127 | 147 | | |
128 | 148 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1 | 1 | | |
2 | 2 | | |
3 | | - | |
4 | 3 | | |
5 | 4 | | |
6 | 5 | | |
| |||
17 | 16 | | |
18 | 17 | | |
19 | 18 | | |
20 | | - | |
| 19 | + | |
| 20 | + | |
21 | 21 | | |
22 | 22 | | |
23 | 23 | | |
| |||
219 | 219 | | |
220 | 220 | | |
221 | 221 | | |
| 222 | + | |
| 223 | + | |
| 224 | + | |
| 225 | + | |
| 226 | + | |
| 227 | + | |
| 228 | + | |
| 229 | + | |
| 230 | + | |
| 231 | + | |
| 232 | + | |
| 233 | + | |
222 | 234 | | |
223 | 235 | | |
224 | | - | |
| 236 | + | |
225 | 237 | | |
226 | 238 | | |
227 | 239 | | |
| |||
268 | 280 | | |
269 | 281 | | |
270 | 282 | | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
| 286 | + | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
| 290 | + | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
| 316 | + | |
| 317 | + | |
| 318 | + | |
| 319 | + | |
| 320 | + | |
| 321 | + | |
| 322 | + | |
| 323 | + | |
| 324 | + | |
| 325 | + | |
| 326 | + | |
| 327 | + | |
| 328 | + | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
| 336 | + | |
| 337 | + | |
| 338 | + | |
| 339 | + | |
| 340 | + | |
| 341 | + | |
| 342 | + | |
| 343 | + | |
| 344 | + | |
| 345 | + | |
| 346 | + | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
3 | 3 | | |
4 | 4 | | |
5 | 5 | | |
| 6 | + | |
| 7 | + | |
6 | 8 | | |
7 | 9 | | |
8 | 10 | | |
| |||
17 | 19 | | |
18 | 20 | | |
19 | 21 | | |
20 | | - | |
| 22 | + | |
21 | 23 | | |
22 | 24 | | |
0 commit comments