Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix by FengPan-Frank · Pull Request #25962 · sonic-net/sonic-buildimage

FengPan-Frank · 2026-03-09T04:37:01Z

sidecar config issue

Why I did it

Support cname based authorization in telemetry. Fixed the config DB key used by telemetry.

How I did it

Turn on telemetry side gnmi_client_role in sidecar, and add watchdog probe functionality.

How to verify it

Before added, no cname check

After added.
known cname cert could visit SHOW

unknown cname cert will be blocked:

two watchdog prob as below:

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

mssonicbld · 2026-03-09T04:37:09Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-09T04:37:20Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

This PR updates SONiC telemetry sidecar/watchdog behavior to support CNAME-based client-certificate authorization by reconciling the correct CONFIG_DB keys and adding new watchdog probes.

Changes:

Add certificate-based “good/bad” gNMI probes to telemetry-watchdog, controlled by a new env var.
Update telemetry-sidecar CONFIG_DB reconciliation to use TELEMETRY|gnmi and support multiple client cert CNAME entries via GNMI_CLIENT_CERTS.
Update sidecar CLI plugin tests to reflect the new CONFIG_DB key and multi-CNAME behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
`dockers/docker-telemetry-watchdog/watchdog/src/main.rs`	Adds cert probe env var + executes additional gNMI probes using hardcoded cert material/CNAMEs.
`dockers/docker-telemetry-sidecar/systemd_stub.py`	Adds JSON parsing for multiple client cert entries and switches reconciliation from `GNMI
`dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py`	Updates reconcile tests for new key name and adds a multi-CNAME test case.

dockers/docker-telemetry-sidecar/systemd_stub.py

Copilot · 2026-03-09T04:39:32Z

dockers/docker-telemetry-sidecar/systemd_stub.py

+            for e in entries:
+                if "cname" not in e or "role" not in e:
+                    raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
+            return entries
+        except (json.JSONDecodeError, ValueError) as exc:
+            logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")
+
+    # Legacy single-entry env vars
+    cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")
+    role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")


_parse_client_certs() validates that each JSON entry has 'cname'/'role' keys, but it doesn't validate types/emptiness (e.g., entry not a dict, cname not a string). Since reconcile_config_db_once later indexes entry["cname"], malformed JSON can still cause runtime exceptions. Consider validating each entry is a dict and coercing/validating cname/role as non-empty strings before returning.

Suggested change

for e in entries:

if "cname" not in e or "role" not in e:

raise ValueError(f"Each entry needs 'cname' and 'role': {e}")

return entries

except (json.JSONDecodeError, ValueError) as exc:

logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")

# Legacy single-entry env vars

cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")

role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")

normalized: List[Dict[str, str]] = []

for e in entries:

if not isinstance(e, dict):

raise ValueError(f"Each entry must be an object: {e!r}")

if "cname" not in e or "role" not in e:

raise ValueError(f"Each entry needs 'cname' and 'role': {e}")

cname = str(e.get("cname", "")).strip()

role = str(e.get("role", "")).strip()

if not cname or not role:

raise ValueError(f"'cname' and 'role' must be non-empty strings: {e}")

normalized.append({"cname": cname, "role": role})

return normalized

except (json.JSONDecodeError, ValueError) as exc:

logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")

# Legacy single-entry env vars

cname = os.getenv("TELEMETRY_CLIENT_CNAME", "").strip()

role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly").strip()

added validation.

Copilot · 2026-03-09T04:39:33Z

dockers/docker-telemetry-sidecar/systemd_stub.py

+def _parse_client_certs() -> List[Dict[str, str]]:
+    """
+    Build the list of GNMI client cert entries from env vars.
+
+    Preferred: GNMI_CLIENT_CERTS  (JSON array of {"cname": ..., "role": ...})
+    Fallback:  TELEMETRY_CLIENT_CNAME / GNMI_CLIENT_ROLE  (single entry, backward-compat)
+    """
+    raw = os.getenv("GNMI_CLIENT_CERTS", "").strip()
+    if raw:
+        try:
+            entries = json.loads(raw)
+            if not isinstance(entries, list):
+                raise ValueError("GNMI_CLIENT_CERTS must be a JSON array")
+            for e in entries:
+                if "cname" not in e or "role" not in e:
+                    raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
+            return entries
+        except (json.JSONDecodeError, ValueError) as exc:
+            logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")
+
+    # Legacy single-entry env vars
+    cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")
+    role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")
+    if cname:
+        return [{"cname": cname, "role": role}]
+    return []
+
+
+GNMI_CLIENT_CERTS: List[Dict[str, str]] = _parse_client_certs()


New GNMI_CLIENT_CERTS parsing / legacy fallback logic in _parse_client_certs() is not covered by tests (e.g., valid JSON array, non-array JSON, invalid JSON fallback to legacy env vars). Adding focused unit tests would help prevent regressions since this runs at import time and affects CONFIG_DB reconciliation behavior.

added test case TestParseClientCerts

Copilot · 2026-03-09T04:39:33Z

dockers/docker-telemetry-watchdog/watchdog/src/main.rs

+                                let bad_sec = TelemetrySecurityConfig {
+                                    use_client_auth: true,
+                                    ca_crt: DEFAULT_BAD_CA.to_string(),
+                                    server_crt: DEFAULT_BAD_CERT.to_string(),
+                                    server_key: DEFAULT_BAD_KEY.to_string(),
+                                };
+                                let mut res_bad = run_gnmi_for_xpath(&xpath_rc, port, &bad_sec, DEFAULT_BAD_CNAME, timeout, "SHOW");
+                                if res_bad.success {
+                                    res_bad.success = false;
+                                    let msg = "Expected FAILURE with BAD cert but command SUCCEEDED".to_string();
+                                    res_bad.error = Some(match res_bad.error.take() {
+                                        Some(existing) => format!("{existing}; {msg}"),
+                                        None => msg,
+                                    });
+                                    http_status = "HTTP/1.1 500 Internal Server Error";
+                                }
+                                cmd_results.push(res_bad);
+
+                                let good_sec = TelemetrySecurityConfig {
+                                    use_client_auth: true,
+                                    ca_crt: DEFAULT_GOOD_CA.to_string(),
+                                    server_crt: DEFAULT_GOOD_CERT.to_string(),
+                                    server_key: DEFAULT_GOOD_KEY.to_string(),
+                                };
+                                let res_good = run_gnmi_for_xpath(&xpath_rc, port, &good_sec, DEFAULT_GOOD_CNAME, timeout, "SHOW");
+                                if !res_good.success { http_status = "HTTP/1.1 500 Internal Server Error"; }


Certificate probes force use_client_auth: true and hardcode cert/key/CA paths instead of using the runtime security config from Redis. If telemetry is running with TELEMETRY|gnmi.client_auth disabled (insecure mode), these probes will fail and flip the watchdog to HTTP 500 even though telemetry is healthy. Consider gating the cert probes on the actual client_auth setting (or treating them as skipped when TLS client auth is off), and reuse sec_cfg for the 'good' probe so it matches the device's configured certs.

Copilot · 2026-03-09T04:39:33Z

dockers/docker-telemetry-watchdog/watchdog/src/main.rs

+        Ok(v) if v.eq_ignore_ascii_case("false") => false,
+        _ => true, // default enabled


is_cert_probe_enabled defaults to enabled unless TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED is explicitly set to "false". This changes watchdog behavior by default and can introduce new health-check failures (and extra gnmi_get calls) on existing deployments. Consider defaulting this probe to disabled (consistent with the serial number probe) or documenting/rolling it out behind an explicit opt-in env var.

Suggested change

Ok(v) if v.eq_ignore_ascii_case("false") => false,

_ => true, // default enabled

Ok(v) if v.eq_ignore_ascii_case("true") => true,

_ => false, // default disabled

FengPan-Frank · 2026-03-09T04:40:07Z

PR is moved from #24743

mssonicbld · 2026-03-09T10:17:21Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-09T10:17:35Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-09T10:26:30Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-09T10:26:41Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

dockers/docker-telemetry-watchdog/watchdog/src/main.rs

dockers/docker-telemetry-sidecar/systemd_stub.py

yxieca · 2026-03-11T04:42:21Z

/azpw ms_conflict

mssonicbld · 2026-03-12T00:06:59Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-12T00:07:12Z

Azure Pipelines successfully started running 1 pipeline(s).

mssonicbld · 2026-03-12T22:25:39Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-12T22:25:51Z

Azure Pipelines successfully started running 1 pipeline(s).

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

sidecar config issue Signed-off-by: Feng Pan <fenpan@microsoft.com>

mssonicbld · 2026-03-12T22:28:49Z

/azp run Azure.sonic-buildimage

azure-pipelines · 2026-03-12T22:29:00Z

Azure Pipelines successfully started running 1 pipeline(s).

…fix (sonic-net#25962) Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix

FengPan-Frank requested a review from lguohan as a code owner March 9, 2026 04:37

Copilot AI review requested due to automatic review settings March 9, 2026 04:37

Copilot started reviewing on behalf of FengPan-Frank March 9, 2026 04:37 View session

FengPan-Frank requested a review from make1980 March 9, 2026 04:38

Copilot AI reviewed Mar 9, 2026

View reviewed changes

FengPan-Frank mentioned this pull request Mar 9, 2026

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog #24743

Closed

9 tasks

FengPan-Frank force-pushed the watchcertss branch from f3f201e to 63713ee Compare March 9, 2026 10:17

Copilot AI review requested due to automatic review settings March 9, 2026 10:26

FengPan-Frank force-pushed the watchcertss branch from 63713ee to 406d45d Compare March 9, 2026 10:26

Copilot started reviewing on behalf of FengPan-Frank March 9, 2026 10:26 View session

Copilot AI reviewed Mar 9, 2026

View reviewed changes

dockers/docker-telemetry-watchdog/watchdog/src/main.rs Outdated Show resolved Hide resolved

dockers/docker-telemetry-sidecar/systemd_stub.py Show resolved Hide resolved

dockers/docker-telemetry-sidecar/systemd_stub.py Show resolved Hide resolved

FengPan-Frank force-pushed the watchcertss branch from 406d45d to a22d23d Compare March 12, 2026 00:06

FengPan-Frank requested review from Copilot and zbud-msft March 12, 2026 08:46

FengPan-Frank force-pushed the watchcertss branch from a22d23d to a72a19c Compare March 12, 2026 22:25

Copilot started reviewing on behalf of FengPan-Frank March 12, 2026 22:26 View session

Copilot AI reviewed Mar 12, 2026

View reviewed changes

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix

b496819

sidecar config issue Signed-off-by: Feng Pan <fenpan@microsoft.com>

FengPan-Frank force-pushed the watchcertss branch from a72a19c to b496819 Compare March 12, 2026 22:28

zbud-msft approved these changes Mar 12, 2026

View reviewed changes

kperumalbfn merged commit f0f4ec6 into sonic-net:master Mar 13, 2026
20 checks passed

yue-fred-gao pushed a commit to yue-fred-gao/sonic-buildimage that referenced this pull request Mar 16, 2026

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and …

60a75ed

…fix (sonic-net#25962) Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix

		Ok(v) if v.eq_ignore_ascii_case("false") => false,
		_ => true, // default enabled

Conversation

FengPan-Frank commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why I did it

How I did it

How to verify it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

Uh oh!

mssonicbld commented Mar 9, 2026

Uh oh!

azure-pipelines bot commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

FengPan-Frank Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

FengPan-Frank Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

FengPan-Frank Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

FengPan-Frank Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

FengPan-Frank commented Mar 9, 2026

Uh oh!

mssonicbld commented Mar 9, 2026

Uh oh!

azure-pipelines bot commented Mar 9, 2026

Uh oh!

mssonicbld commented Mar 9, 2026

Uh oh!

azure-pipelines bot commented Mar 9, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxieca commented Mar 11, 2026

Uh oh!

mssonicbld commented Mar 12, 2026

Uh oh!

azure-pipelines bot commented Mar 12, 2026

Uh oh!

mssonicbld commented Mar 12, 2026

Uh oh!

azure-pipelines bot commented Mar 12, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

mssonicbld commented Mar 12, 2026

Uh oh!

azure-pipelines bot commented Mar 12, 2026

FengPan-Frank commented Mar 9, 2026 •

edited

Loading