Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix#25962
Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix#25962kperumalbfn merged 1 commit intosonic-net:masterfrom
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
This PR updates SONiC telemetry sidecar/watchdog behavior to support CNAME-based client-certificate authorization by reconciling the correct CONFIG_DB keys and adding new watchdog probes.
Changes:
- Add certificate-based “good/bad” gNMI probes to telemetry-watchdog, controlled by a new env var.
- Update telemetry-sidecar CONFIG_DB reconciliation to use
TELEMETRY|gnmiand support multiple client cert CNAME entries viaGNMI_CLIENT_CERTS. - Update sidecar CLI plugin tests to reflect the new CONFIG_DB key and multi-CNAME behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
dockers/docker-telemetry-watchdog/watchdog/src/main.rs |
Adds cert probe env var + executes additional gNMI probes using hardcoded cert material/CNAMEs. |
dockers/docker-telemetry-sidecar/systemd_stub.py |
Adds JSON parsing for multiple client cert entries and switches reconciliation from `GNMI |
dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py |
Updates reconcile tests for new key name and adds a multi-CNAME test case. |
| for e in entries: | ||
| if "cname" not in e or "role" not in e: | ||
| raise ValueError(f"Each entry needs 'cname' and 'role': {e}") | ||
| return entries | ||
| except (json.JSONDecodeError, ValueError) as exc: | ||
| logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy") | ||
|
|
||
| # Legacy single-entry env vars | ||
| cname = os.getenv("TELEMETRY_CLIENT_CNAME", "") | ||
| role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly") |
There was a problem hiding this comment.
_parse_client_certs() validates that each JSON entry has 'cname'/'role' keys, but it doesn't validate types/emptiness (e.g., entry not a dict, cname not a string). Since reconcile_config_db_once later indexes entry["cname"], malformed JSON can still cause runtime exceptions. Consider validating each entry is a dict and coercing/validating cname/role as non-empty strings before returning.
| for e in entries: | |
| if "cname" not in e or "role" not in e: | |
| raise ValueError(f"Each entry needs 'cname' and 'role': {e}") | |
| return entries | |
| except (json.JSONDecodeError, ValueError) as exc: | |
| logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy") | |
| # Legacy single-entry env vars | |
| cname = os.getenv("TELEMETRY_CLIENT_CNAME", "") | |
| role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly") | |
| normalized: List[Dict[str, str]] = [] | |
| for e in entries: | |
| if not isinstance(e, dict): | |
| raise ValueError(f"Each entry must be an object: {e!r}") | |
| if "cname" not in e or "role" not in e: | |
| raise ValueError(f"Each entry needs 'cname' and 'role': {e}") | |
| cname = str(e.get("cname", "")).strip() | |
| role = str(e.get("role", "")).strip() | |
| if not cname or not role: | |
| raise ValueError(f"'cname' and 'role' must be non-empty strings: {e}") | |
| normalized.append({"cname": cname, "role": role}) | |
| return normalized | |
| except (json.JSONDecodeError, ValueError) as exc: | |
| logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy") | |
| # Legacy single-entry env vars | |
| cname = os.getenv("TELEMETRY_CLIENT_CNAME", "").strip() | |
| role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly").strip() |
There was a problem hiding this comment.
added validation.
| def _parse_client_certs() -> List[Dict[str, str]]: | ||
| """ | ||
| Build the list of GNMI client cert entries from env vars. | ||
|
|
||
| Preferred: GNMI_CLIENT_CERTS (JSON array of {"cname": ..., "role": ...}) | ||
| Fallback: TELEMETRY_CLIENT_CNAME / GNMI_CLIENT_ROLE (single entry, backward-compat) | ||
| """ | ||
| raw = os.getenv("GNMI_CLIENT_CERTS", "").strip() | ||
| if raw: | ||
| try: | ||
| entries = json.loads(raw) | ||
| if not isinstance(entries, list): | ||
| raise ValueError("GNMI_CLIENT_CERTS must be a JSON array") | ||
| for e in entries: | ||
| if "cname" not in e or "role" not in e: | ||
| raise ValueError(f"Each entry needs 'cname' and 'role': {e}") | ||
| return entries | ||
| except (json.JSONDecodeError, ValueError) as exc: | ||
| logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy") | ||
|
|
||
| # Legacy single-entry env vars | ||
| cname = os.getenv("TELEMETRY_CLIENT_CNAME", "") | ||
| role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly") | ||
| if cname: | ||
| return [{"cname": cname, "role": role}] | ||
| return [] | ||
|
|
||
|
|
||
| GNMI_CLIENT_CERTS: List[Dict[str, str]] = _parse_client_certs() |
There was a problem hiding this comment.
New GNMI_CLIENT_CERTS parsing / legacy fallback logic in _parse_client_certs() is not covered by tests (e.g., valid JSON array, non-array JSON, invalid JSON fallback to legacy env vars). Adding focused unit tests would help prevent regressions since this runs at import time and affects CONFIG_DB reconciliation behavior.
There was a problem hiding this comment.
added test case TestParseClientCerts
| let bad_sec = TelemetrySecurityConfig { | ||
| use_client_auth: true, | ||
| ca_crt: DEFAULT_BAD_CA.to_string(), | ||
| server_crt: DEFAULT_BAD_CERT.to_string(), | ||
| server_key: DEFAULT_BAD_KEY.to_string(), | ||
| }; | ||
| let mut res_bad = run_gnmi_for_xpath(&xpath_rc, port, &bad_sec, DEFAULT_BAD_CNAME, timeout, "SHOW"); | ||
| if res_bad.success { | ||
| res_bad.success = false; | ||
| let msg = "Expected FAILURE with BAD cert but command SUCCEEDED".to_string(); | ||
| res_bad.error = Some(match res_bad.error.take() { | ||
| Some(existing) => format!("{existing}; {msg}"), | ||
| None => msg, | ||
| }); | ||
| http_status = "HTTP/1.1 500 Internal Server Error"; | ||
| } | ||
| cmd_results.push(res_bad); | ||
|
|
||
| let good_sec = TelemetrySecurityConfig { | ||
| use_client_auth: true, | ||
| ca_crt: DEFAULT_GOOD_CA.to_string(), | ||
| server_crt: DEFAULT_GOOD_CERT.to_string(), | ||
| server_key: DEFAULT_GOOD_KEY.to_string(), | ||
| }; | ||
| let res_good = run_gnmi_for_xpath(&xpath_rc, port, &good_sec, DEFAULT_GOOD_CNAME, timeout, "SHOW"); | ||
| if !res_good.success { http_status = "HTTP/1.1 500 Internal Server Error"; } |
There was a problem hiding this comment.
Certificate probes force use_client_auth: true and hardcode cert/key/CA paths instead of using the runtime security config from Redis. If telemetry is running with TELEMETRY|gnmi.client_auth disabled (insecure mode), these probes will fail and flip the watchdog to HTTP 500 even though telemetry is healthy. Consider gating the cert probes on the actual client_auth setting (or treating them as skipped when TLS client auth is off), and reuse sec_cfg for the 'good' probe so it matches the device's configured certs.
| Ok(v) if v.eq_ignore_ascii_case("false") => false, | ||
| _ => true, // default enabled |
There was a problem hiding this comment.
is_cert_probe_enabled defaults to enabled unless TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED is explicitly set to "false". This changes watchdog behavior by default and can introduce new health-check failures (and extra gnmi_get calls) on existing deployments. Consider defaulting this probe to disabled (consistent with the serial number probe) or documenting/rolling it out behind an explicit opt-in env var.
| Ok(v) if v.eq_ignore_ascii_case("false") => false, | |
| _ => true, // default enabled | |
| Ok(v) if v.eq_ignore_ascii_case("true") => true, | |
| _ => false, // default disabled |
|
PR is moved from #24743 |
f3f201e to
63713ee
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
63713ee to
406d45d
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azpw ms_conflict |
406d45d to
a22d23d
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
a22d23d to
a72a19c
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sidecar config issue Signed-off-by: Feng Pan <fenpan@microsoft.com>
a72a19c to
b496819
Compare
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
…fix (sonic-net#25962) Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix
sidecar config issue
Why I did it
Support cname based authorization in telemetry. Fixed the config DB key used by telemetry.
How I did it
Turn on telemetry side gnmi_client_role in sidecar, and add watchdog probe functionality.
How to verify it
Before added, no cname check

After added.

known cname cert could visit SHOW
unknown cname cert will be blocked:

two watchdog prob as below:

Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)