Skip to content

Commit dfbb6e7

Browse files
Fix MACsec test reliability and configuration issues (sonic-net#21372)
* ISS-2888:Fix JSON syntax in golden_config_db_t2.j2 template (sonic-net#401) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: Fixes # (issue) Fixes below json syntax error. It is seen only when dut is prepared with macsec enable flag. json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 3 (char 4) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-2969:Generate golden config only if macsec_profile is defined (sonic-net#420) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR Redundant override config is avoided as no macsec profile is set in the prepare phase. Below are details how macsec profile configurations are rendered: PREPARE phase: Uses generate_t2_golden_config_db() → template rendering → file-based config RUN phase: Uses set_macsec_profile() → direct sonic-db-cli commands → immediate CONFIG_DB update Summary: Fixes # (issue) ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [x] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ ] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? #### How did you do it? #### How did you verify/test it? #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * ISS-3251: Guard MACsec restart against systemd StartLimitHit; add restart helper (sonic-net#562) <!-- Please make sure you've read and understood our contributing guidelines; https://github.com/sonic-net/SONiC/blob/gh-pages/CONTRIBUTING.md Please provide following information to help code review process a bit easier: --> ### Description of PR <!-- - Please include a summary of the change and which issue is fixed. - Please also include relevant motivation and context. Where should reviewer start? background context? - List any dependencies that are required for this change. --> Summary: • Add a StartLimitHit-safe restart helper and use it in MACsec docker restart test to reduce flakiness • New helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py: • Proactively clears systemd failure counters (systemctl reset-failed) • Attempts restart, detects systemd rate limiting (StartLimitHit), applies bounded backoff (default 35s), then start • Verifies the target container becomes running within a timeout • Update tests/macsec/test_docker_restart.py to use the new helper instead of duthost.restart_service("macsec") Fixes # (issue) MACsec docker restart tests can intermittently fail due to systemd rate limiting after repeated restarts during teardown/restart cycles. • Guarding against StartLimitHit with a clear backoff-and-start flow improves test reliability without changing device behavior. ### Type of change <!-- - Fill x for your type of change. - e.g. - [x] Bug fix --> - [ ] Bug fix - [ ] Testbed and Framework(new/improvement) - [ ] New Test case - [ ] Skipped for non-supported platforms - [ x] Test case improvement ### Back port request - [ ] 202205 - [ ] 202305 - [ ] 202311 - [ ] 202405 - [ ] 202411 - [ ] 202505 ### Approach #### What is the motivation for this PR? • MACsec docker restart tests can intermittently fail when systemd enforces StartLimitHit due to rapid restart attempts during teardown/restart cycles. • This PR makes the restart path resilient to StartLimitHit by proactively clearing counters, applying bounded backoff, and verifying the container reaches the running state, thereby reducing test flakiness. #### How did you do it? • Added a helper restart_service_with_startlimit_guard() in tests/common/helpers/dut_utils.py that: • Detects StartLimitHit pre/post restart attempts • Runs systemctl reset-failed to clear counters • Applies a fixed backoff when rate-limited, then systemctl start • Verifies the container is running within a configurable timeout using existing wait_until/state checks • Updated tests/macsec/test_docker_restart.py to use the helper instead of a direct duthost.restart_service("macsec") call. #### How did you verify/test it? • Local validation in lab: • Executed tests/macsec/test_docker_restart.py::test_restart_macsec_docker with MACsec enabled. • Repeated the restart sequence to emulate rate limiting scenarios. • Verified the helper reliably recovers from StartLimitHit and the container becomes running within the timeout. #### Any platform specific information? #### Supported testbed topology if it's a new test case? ### Documentation <!-- (If it's a new feature, new test case) Did you update documentation/Wiki relevant to your implementation? Link to the wiki page? --> Signed-off-by: rajshekhar <[email protected]> * NOS-3311: Fix MACsec test race and cleanup sync (sonic-net#678) NOS-3311 tracks MACsec test flakiness caused by races between: * wpa_supplicant/MKA programming MACsec state into Redis (APPL/STATE DB), and * the test harness eagerly reading that state to build `MACSEC_INFO` (via `get_macsec_attr`). This can manifest as exceptions like `KeyError('sak')` when the MACsec egress SA row does not yet exist, even though `MACSEC_PORT_TABLE` already shows `enable_encrypt="true"`. There are also cleanup races where tests check for removal of MACsec DB entries before the background cleanup logic has finished. This PR adds two pieces of synchronization in sonic-mgmt: 1. Ensure MKA establishment before pre-loading MACsec session info for tests 2. Provide a helper to wait for MACsec DB cleanup after disabling MACsec File: `tests/common/macsec/__init__.py` * The `load_macsec_info` fixture (module-scoped, autouse) previously called `load_all_macsec_info()` immediately when MACsec was enabled and a profile was present. That in turn calls `get_macsec_attr()`, which expects APP/STATE DB MACsec SC/SA entries (including `sak`) to be fully programmed. * In environments where MACsec is pre-configured before tests start, this created a race: `MACSEC_PORT_TABLE` might already exist (with `enable_encrypt="true"`), but the egress SA row for the active AN might not yet have been written to APP_DB, leading to `KeyError('sak')` when `macsec_sa["sak"]` is accessed. * Fix: * When MACsec is enabled and a profile is present, the fixture now first *attempts* to resolve the `wait_mka_establish` fixture: ```python try: request.getfixturevalue('wait_mka_establish') except Exception: pass ``` * `wait_mka_establish` is defined in `tests/macsec/conftest.py` and internally uses `check_appl_db` plus `wait_until(...)` to ensure APP/STATE DB MACsec SC/SA tables are populated (including `sak`/`auth_key`/PN relationships) before returning. * If the fixture is not defined (e.g., in other environments or test suites), the code falls back to the previous behavior. * After this synchronization point, if `is_macsec_configured(...)` is true, `load_all_macsec_info()` is called to populate `MACSEC_INFO` for all control links. Otherwise, the original `macsec_setup` flow is triggered. This makes `get_macsec_attr()` execution order consistent with the rest of the MACsec test suite, which already relies on `wait_mka_establish`/`check_appl_db` to guarantee that egress SAs and SAKs exist before validating state. cleanup File: `tests/common/macsec/macsec_config_helper.py` * Add `wait_for_macsec_cleanup(host, interfaces, timeout=90)` and export it via `__all__`. * This helper is designed for tests that: * disable MACsec on one or more interfaces, and then * need to assert that all associated MACsec entries (port, SC, SA) have been automatically removed from Redis before proceeding. * Behavior: * For EOS neighbors, it is a no-op: they do not use Redis DBs and the function returns `True` immediately. * For SONiC hosts, it: * Polls both `APPL_DB` and `STATE_DB` using `redis_get_keys_all_asics` with patterns `MACSEC_*:{interface}*` (APPL_DB) and `MACSEC_*|{interface}*` (STATE_DB). * Aggregates any remaining keys per DB. * Returns `True` as soon as all such keys are gone for the given interfaces, logging total time taken. * If the `timeout` is exceeded, logs a warning, prints a summary of remaining entries, and returns `False`. * This centralizes the logic for “wait until MACsec entries are gone from Redis” instead of having ad hoc sleeps or partial checks in individual tests. * MACsec control-plane actions (via wpa_supplicant and swss/macsecorch) are asynchronous relative to the tests. It is valid for `MACSEC_PORT_TABLE` to show `enable_encrypt="true"` while transmit SAs and their SAKs are still being programmed. * `get_macsec_attr()` assumes that: * APP_DB `MACSEC_EGRESS_SC_TABLE` for `(port, sci)` exists and has a valid `encoding_an`, and * APP_DB `MACSEC_EGRESS_SA_TABLE` for `(port, sci, an)` exists and has a `sak` field. Without synchronization, tests that pre-load `MACSEC_INFO` can hit a window where the SA row does not yet exist and crash with `KeyError('sak')`. * By tying `load_macsec_info` to `wait_mka_establish` where available, we ensure those pre-loads happen only after the expected MACsec state has been fully written to Redis. * Similarly, when disabling MACsec, asynchronous background cleanup can lag behind the test’s expectations. Having a dedicated, reusable `wait_for_macsec_cleanup` helper lets future tests explicitly wait for cleanup completion instead of guessing with sleeps. * Verified that the new fixtures and helpers are imported and wired correctly: * `load_macsec_info` remains `autouse=True` at module scope, so existing MACsec tests automatically benefit from the additional synchronization. * `wait_for_macsec_cleanup` is exported in `__all__` for use by future MACsec tests. * Manually exercised MACsec configuration and teardown flows in a MACsec-enabled testbed (e.g., humm120) to confirm: * MACsec sessions establish successfully and APP/STATE DB contain expected MACsec entries before `load_all_macsec_info` is invoked. * Disabling MACsec followed by `wait_for_macsec_cleanup` results in all MACSEC_* keys being removed from APPL/STATE DB within the timeout window. --- Pull Request opened by [Augment Code](https://www.augmentcode.com/) with guidance from the PR author Signed-off-by: rajshekhar <[email protected]> * taking care of review comments - Refine restart_service_with_startlimit_guard to better handle pre-existing StartLimitHit, avoid unnecessary restarts, and apply a shorter backoff when not actually rate-limited. - Narrow the exception in MacsecPlugin to pytest.FixtureLookupError so we only fall back when the wait_mka_establish fixture is truly missing. - Make wait_for_macsec_cleanup more flexible by using a dynamic poll interval and relying on its default timeout from callers. Signed-off-by: rajshekhar <[email protected]> --------- Signed-off-by: rajshekhar <[email protected]> Signed-off-by: Abhishek <[email protected]>
1 parent a1a5262 commit dfbb6e7

6 files changed

Lines changed: 188 additions & 8 deletions

File tree

ansible/config_sonic_basedon_testbed.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -756,7 +756,7 @@
756756
macsec_profile: "{{ macsec_profile }}"
757757
num_asics: "{{ num_asics }}"
758758
become: true
759-
when: "('t2' in topo) and (enable_macsec is defined)"
759+
when: "('t2' in topo) and (macsec_profile is defined)"
760760

761761
- name: Use minigraph case
762762
block:

ansible/templates/golden_config_db_t2.j2

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,8 +23,7 @@
2323
{% endif %}
2424
{%- endfor -%}
2525
{% else %}
26-
{
27-
"MACSEC_PROFILE": {
26+
"MACSEC_PROFILE": {
2827
"{{macsec_profile}}": {
2928
"priority": "{{priority}}",
3029
"cipher_suite": "{{cipher_suite}}",
@@ -34,6 +33,5 @@
3433
"send_sci": "{{send_sci}}"
3534
}
3635
}
37-
},
3836
{%- endif -%}
3937
}

tests/common/helpers/dut_utils.py

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -126,6 +126,64 @@ def clear_failed_flag_and_restart(duthost, container_name):
126126
pytest_assert(restarted, "Failed to restart container '{}' after reset-failed was cleared".format(container_name))
127127

128128

129+
def restart_service_with_startlimit_guard(duthost, service_name, backoff_seconds=30, verify_timeout=180):
130+
"""
131+
Restart a systemd-managed service with StartLimitHit guard.
132+
133+
Strategy:
134+
0) Pre-detect StartLimitHit and, if present, skip a failing restart
135+
1) When not rate-limited, reset-failed to clear stale counters and try restart
136+
2) If restart fails, rate-limit is detected, or container isn't running:
137+
- 'systemctl reset-failed <service>.service'
138+
- fixed backoff (default 30s when rate-limited, 1s otherwise)
139+
- 'systemctl start <service>.service'
140+
- wait until container is running
141+
142+
Returns: True when the service is (re)started and running; asserts on failure.
143+
"""
144+
145+
# 0) Pre-detect StartLimitHit so we can optionally skip a failing restart
146+
pre_rate_limited = is_hitting_start_limit(duthost, service_name)
147+
148+
if not pre_rate_limited:
149+
# 1) Proactively clear stale failure counters and try a normal restart
150+
duthost.shell(
151+
f"sudo systemctl reset-failed {service_name}.service",
152+
module_ignore_errors=True
153+
)
154+
ret = duthost.shell(
155+
f"sudo systemctl restart {service_name}.service",
156+
module_ignore_errors=True
157+
)
158+
rate_limited = is_hitting_start_limit(duthost, service_name)
159+
else:
160+
logger.info(
161+
f"StartLimitHit pre-detected for {service_name}, applying reset-failed and "
162+
f"fixed backoff {backoff_seconds}s before start"
163+
)
164+
# Force the recovery path below without attempting an immediate restart.
165+
ret = {"rc": 1}
166+
rate_limited = True
167+
168+
# 2/3) Recovery path: reset-failed + backoff + start if needed
169+
if ret.get("rc", 1) != 0 or rate_limited or not is_container_running(duthost, service_name):
170+
duthost.shell(
171+
f"sudo systemctl reset-failed {service_name}.service",
172+
module_ignore_errors=True
173+
)
174+
time.sleep(backoff_seconds if rate_limited else 1)
175+
duthost.shell(
176+
f"sudo systemctl start {service_name}.service",
177+
module_ignore_errors=True
178+
)
179+
pytest_assert(
180+
wait_until(verify_timeout, 1, 0, check_container_state, duthost, service_name, True),
181+
f"{service_name} container did not become running after recovery start"
182+
)
183+
184+
return True
185+
186+
129187
def get_group_program_info(duthost, container_name, group_name):
130188
"""Gets program names, running status and their pids by analyzing the command
131189
output of "docker exec <container_name> supervisorctl status". Program name

tests/common/macsec/__init__.py

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -122,7 +122,27 @@ def macsec_setup(self, startup_macsec, shutdown_macsec, macsec_feature):
122122

123123
@pytest.fixture(scope="module", autouse=True)
124124
def load_macsec_info(self, request, macsec_duthost, ctrl_links, macsec_profile, tbinfo):
125+
"""Pre-load MACsec session info for all control links.
126+
127+
If MACsec is enabled and configured for this DUT/profile, wait for
128+
MKA establishment (APP/STATE DB populated with SC/SA, including SAK)
129+
before calling ``load_all_macsec_info``. This avoids races where
130+
``get_macsec_attr`` hits APP_DB before the egress SA row (and ``sak``)
131+
has been written by wpa_supplicant.
132+
"""
133+
125134
if get_macsec_enable_status(macsec_duthost) and get_macsec_profile(macsec_duthost):
135+
# Ensure MKA sessions are established (SC/SA present in DB) if the
136+
# test environment provides the wait_mka_establish fixture
137+
# (defined in tests/macsec/conftest.py). For environments that do
138+
# not define it, fall back to the original behaviour.
139+
try:
140+
request.getfixturevalue('wait_mka_establish')
141+
except pytest.FixtureLookupError:
142+
# Some environments do not define wait_mka_establish; fall back
143+
# to the original behaviour when the fixture is missing.
144+
pass
145+
126146
if is_macsec_configured(macsec_duthost, macsec_profile, ctrl_links):
127147
load_all_macsec_info(macsec_duthost, ctrl_links, tbinfo)
128148
else:

tests/common/macsec/macsec_config_helper.py

Lines changed: 105 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
import logging
22
import time
3-
43
from tests.common.macsec.macsec_helper import get_mka_session, getns_prefix, wait_all_complete, \
54
submit_async_task, load_all_macsec_info
65
from tests.common.macsec.macsec_platform_helper import global_cmd, find_portchannel_from_member, get_portchannel
@@ -17,7 +16,8 @@
1716
'enable_macsec_port',
1817
'disable_macsec_port',
1918
'get_macsec_enable_status',
20-
'get_macsec_profile'
19+
'get_macsec_profile',
20+
'wait_for_macsec_cleanup'
2121
]
2222

2323
logger = logging.getLogger(__name__)
@@ -219,9 +219,21 @@ def cleanup_macsec_configuration(duthost, ctrl_links, profile_name):
219219
submit_async_task(delete_macsec_profile, (d, None, profile_name))
220220
wait_all_complete(timeout=300)
221221

222+
logger.info("Cleanup macsec configuration step3: wait for automatic cleanup")
223+
224+
# Extract DUT interface names from ctrl_links and wait for automatic
225+
# MACsec cleanup on the DUT side.
226+
interfaces = list(ctrl_links.keys())
227+
wait_for_macsec_cleanup(duthost, interfaces)
228+
229+
# Also wait for neighbor devices to complete automatic cleanup for their
230+
# corresponding ports.
231+
for dut_port, nbr in list(ctrl_links.items()):
232+
wait_for_macsec_cleanup(nbr["host"], [nbr["port"]])
233+
222234
logger.info("Cleanup macsec configuration finished")
223235

224-
# Waiting for all mka session were cleared in all devices
236+
# Waiting for all MKA sessions to be cleared on neighbor devices.
225237
for d in devices:
226238
if isinstance(d, EosHost):
227239
continue
@@ -268,3 +280,93 @@ def setup_macsec_configuration(duthost, ctrl_links, profile_name, default_priori
268280

269281
# Load the MACSEC_INFO, to have data of all macsec sessions
270282
load_all_macsec_info(duthost, ctrl_links, tbinfo)
283+
284+
285+
def wait_for_macsec_cleanup(host, interfaces, timeout=90):
286+
"""Wait for MACsec daemon to automatically clean up all MACsec entries.
287+
288+
This function implements proper synchronization to wait for the automatic
289+
cleanup process to complete, preserving the intended MACsec cleanup behavior.
290+
291+
Args:
292+
host: SONiC DUT or neighbor host object
293+
interfaces: List of interface names to check
294+
timeout: Maximum time to wait in seconds for MACsec cleanup to finish (default: 90).
295+
296+
Returns:
297+
bool: True if cleanup completed, False if timeout
298+
"""
299+
if isinstance(host, EosHost):
300+
# EOS hosts don't use Redis databases
301+
logger.info("EOS host detected, skipping Redis cleanup verification")
302+
return True
303+
304+
logger.info(f"Waiting for automatic MACsec cleanup (timeout: {timeout}s)")
305+
306+
start_time = time.time()
307+
# Poll at most ~10 times over the full timeout, capped at 10 seconds between checks.
308+
poll_interval = min(10, max(1, timeout / 10.0))
309+
310+
# We only care about APPL_DB and STATE_DB for MACsec tables. Instead of
311+
# trying to reverse-engineer numeric DB IDs from CONFIG_DB, rely on
312+
# sonic-db-cli with logical DB names and the same namespace logic used
313+
# elsewhere in MACsec helpers.
314+
315+
while time.time() - start_time < timeout:
316+
all_clean = True
317+
remaining_entries = {}
318+
319+
for interface in interfaces:
320+
ns_prefix = getns_prefix(host, interface)
321+
322+
for db_name, sep in (("APPL_DB", ":"), ("STATE_DB", "|")):
323+
pattern = f"MACSEC_*{sep}{interface}*"
324+
cmd = f"sonic-db-cli {ns_prefix} {db_name} KEYS '{pattern}'"
325+
326+
try:
327+
result = host.command(cmd, verbose=False)
328+
out_lines = result.get("stdout_lines", [])
329+
except Exception as e:
330+
logger.warning(
331+
"Failed to query MACsec keys on host %s, DB %s, interface %s: %r",
332+
getattr(host, 'hostname', host),
333+
db_name,
334+
interface,
335+
e,
336+
)
337+
# If we cannot query Redis for this DB/interface, be
338+
# conservative and assume cleanup is not complete yet.
339+
all_clean = False
340+
continue
341+
342+
keys = [k.strip() for k in out_lines if k.strip()]
343+
if keys:
344+
all_clean = False
345+
remaining_entries.setdefault((db_name, interface), []).extend(keys)
346+
347+
elapsed = time.time() - start_time
348+
349+
if all_clean:
350+
logger.info(
351+
f"Automatic MACsec cleanup completed successfully in {elapsed:.1f}s"
352+
)
353+
return True
354+
355+
# Log progress every 30 seconds to reduce verbosity
356+
if int(elapsed) % 30 == 0 and elapsed > 0:
357+
logger.info(f"Still waiting for cleanup... ({elapsed:.0f}s elapsed)")
358+
359+
time.sleep(poll_interval)
360+
361+
# Timeout reached
362+
elapsed = time.time() - start_time
363+
logger.warning(f"Automatic MACsec cleanup timeout after {elapsed:.1f}s")
364+
365+
# Log summary of remaining entries
366+
total_remaining = sum(len(entries) for entries in remaining_entries.values())
367+
if total_remaining > 0:
368+
logger.warning(
369+
f" {total_remaining} MACsec entries still remain after timeout"
370+
)
371+
372+
return False

tests/macsec/test_docker_restart.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@
33

44
from tests.common.utilities import wait_until
55
from tests.common.macsec.macsec_helper import check_appl_db
6+
from tests.common.helpers.dut_utils import restart_service_with_startlimit_guard
7+
68

79
logger = logging.getLogger(__name__)
810

@@ -17,6 +19,6 @@ def test_restart_macsec_docker(duthosts, ctrl_links, policy, cipher_suite, send_
1719
duthost = duthosts[enum_rand_one_per_hwsku_macsec_frontend_hostname]
1820

1921
logger.info(duthost.shell(cmd="docker ps", module_ignore_errors=True)['stdout'])
20-
duthost.restart_service("macsec")
22+
restart_service_with_startlimit_guard(duthost, "macsec", backoff_seconds=35, verify_timeout=180)
2123
logger.info(duthost.shell(cmd="docker ps", module_ignore_errors=True)['stdout'])
2224
assert wait_until(300, 6, 12, check_appl_db, duthost, ctrl_links, policy, cipher_suite, send_sci)

0 commit comments

Comments
 (0)