Skip to content

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix#25962

Merged
kperumalbfn merged 1 commit intosonic-net:masterfrom
FengPan-Frank:watchcertss
Mar 13, 2026
Merged

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix#25962
kperumalbfn merged 1 commit intosonic-net:masterfrom
FengPan-Frank:watchcertss

Conversation

@FengPan-Frank
Copy link
Contributor

@FengPan-Frank FengPan-Frank commented Mar 9, 2026

sidecar config issue

Why I did it

Support cname based authorization in telemetry. Fixed the config DB key used by telemetry.

How I did it

Turn on telemetry side gnmi_client_role in sidecar, and add watchdog probe functionality.

How to verify it

Before added, no cname check
image

After added.
known cname cert could visit SHOW
image

unknown cname cert will be blocked:
image

two watchdog prob as below:
image

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@FengPan-Frank FengPan-Frank requested a review from lguohan as a code owner March 9, 2026 04:37
Copilot AI review requested due to automatic review settings March 9, 2026 04:37
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates SONiC telemetry sidecar/watchdog behavior to support CNAME-based client-certificate authorization by reconciling the correct CONFIG_DB keys and adding new watchdog probes.

Changes:

  • Add certificate-based “good/bad” gNMI probes to telemetry-watchdog, controlled by a new env var.
  • Update telemetry-sidecar CONFIG_DB reconciliation to use TELEMETRY|gnmi and support multiple client cert CNAME entries via GNMI_CLIENT_CERTS.
  • Update sidecar CLI plugin tests to reflect the new CONFIG_DB key and multi-CNAME behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
dockers/docker-telemetry-watchdog/watchdog/src/main.rs Adds cert probe env var + executes additional gNMI probes using hardcoded cert material/CNAMEs.
dockers/docker-telemetry-sidecar/systemd_stub.py Adds JSON parsing for multiple client cert entries and switches reconciliation from `GNMI
dockers/docker-telemetry-sidecar/cli-plugin-tests/test_systemd_stub.py Updates reconcile tests for new key name and adds a multi-CNAME test case.

Comment on lines +38 to +47
for e in entries:
if "cname" not in e or "role" not in e:
raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
return entries
except (json.JSONDecodeError, ValueError) as exc:
logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")

# Legacy single-entry env vars
cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")
role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_parse_client_certs() validates that each JSON entry has 'cname'/'role' keys, but it doesn't validate types/emptiness (e.g., entry not a dict, cname not a string). Since reconcile_config_db_once later indexes entry["cname"], malformed JSON can still cause runtime exceptions. Consider validating each entry is a dict and coercing/validating cname/role as non-empty strings before returning.

Suggested change
for e in entries:
if "cname" not in e or "role" not in e:
raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
return entries
except (json.JSONDecodeError, ValueError) as exc:
logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")
# Legacy single-entry env vars
cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")
role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")
normalized: List[Dict[str, str]] = []
for e in entries:
if not isinstance(e, dict):
raise ValueError(f"Each entry must be an object: {e!r}")
if "cname" not in e or "role" not in e:
raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
cname = str(e.get("cname", "")).strip()
role = str(e.get("role", "")).strip()
if not cname or not role:
raise ValueError(f"'cname' and 'role' must be non-empty strings: {e}")
normalized.append({"cname": cname, "role": role})
return normalized
except (json.JSONDecodeError, ValueError) as exc:
logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")
# Legacy single-entry env vars
cname = os.getenv("TELEMETRY_CLIENT_CNAME", "").strip()
role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly").strip()

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added validation.

Comment on lines +25 to +53
def _parse_client_certs() -> List[Dict[str, str]]:
"""
Build the list of GNMI client cert entries from env vars.

Preferred: GNMI_CLIENT_CERTS (JSON array of {"cname": ..., "role": ...})
Fallback: TELEMETRY_CLIENT_CNAME / GNMI_CLIENT_ROLE (single entry, backward-compat)
"""
raw = os.getenv("GNMI_CLIENT_CERTS", "").strip()
if raw:
try:
entries = json.loads(raw)
if not isinstance(entries, list):
raise ValueError("GNMI_CLIENT_CERTS must be a JSON array")
for e in entries:
if "cname" not in e or "role" not in e:
raise ValueError(f"Each entry needs 'cname' and 'role': {e}")
return entries
except (json.JSONDecodeError, ValueError) as exc:
logger.log_error(f"Bad GNMI_CLIENT_CERTS env var: {exc}; falling back to legacy")

# Legacy single-entry env vars
cname = os.getenv("TELEMETRY_CLIENT_CNAME", "")
role = os.getenv("GNMI_CLIENT_ROLE", "gnmi_show_readonly")
if cname:
return [{"cname": cname, "role": role}]
return []


GNMI_CLIENT_CERTS: List[Dict[str, str]] = _parse_client_certs()
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New GNMI_CLIENT_CERTS parsing / legacy fallback logic in _parse_client_certs() is not covered by tests (e.g., valid JSON array, non-array JSON, invalid JSON fallback to legacy env vars). Adding focused unit tests would help prevent regressions since this runs at import time and affects CONFIG_DB reconciliation behavior.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added test case TestParseClientCerts

Comment on lines +412 to +437
let bad_sec = TelemetrySecurityConfig {
use_client_auth: true,
ca_crt: DEFAULT_BAD_CA.to_string(),
server_crt: DEFAULT_BAD_CERT.to_string(),
server_key: DEFAULT_BAD_KEY.to_string(),
};
let mut res_bad = run_gnmi_for_xpath(&xpath_rc, port, &bad_sec, DEFAULT_BAD_CNAME, timeout, "SHOW");
if res_bad.success {
res_bad.success = false;
let msg = "Expected FAILURE with BAD cert but command SUCCEEDED".to_string();
res_bad.error = Some(match res_bad.error.take() {
Some(existing) => format!("{existing}; {msg}"),
None => msg,
});
http_status = "HTTP/1.1 500 Internal Server Error";
}
cmd_results.push(res_bad);

let good_sec = TelemetrySecurityConfig {
use_client_auth: true,
ca_crt: DEFAULT_GOOD_CA.to_string(),
server_crt: DEFAULT_GOOD_CERT.to_string(),
server_key: DEFAULT_GOOD_KEY.to_string(),
};
let res_good = run_gnmi_for_xpath(&xpath_rc, port, &good_sec, DEFAULT_GOOD_CNAME, timeout, "SHOW");
if !res_good.success { http_status = "HTTP/1.1 500 Internal Server Error"; }
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certificate probes force use_client_auth: true and hardcode cert/key/CA paths instead of using the runtime security config from Redis. If telemetry is running with TELEMETRY|gnmi.client_auth disabled (insecure mode), these probes will fail and flip the watchdog to HTTP 500 even though telemetry is healthy. Consider gating the cert probes on the actual client_auth setting (or treating them as skipped when TLS client auth is off), and reuse sec_cfg for the 'good' probe so it matches the device's configured certs.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Comment on lines +322 to +323
Ok(v) if v.eq_ignore_ascii_case("false") => false,
_ => true, // default enabled
Copy link

Copilot AI Mar 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_cert_probe_enabled defaults to enabled unless TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED is explicitly set to "false". This changes watchdog behavior by default and can introduce new health-check failures (and extra gnmi_get calls) on existing deployments. Consider defaulting this probe to disabled (consistent with the serial number probe) or documenting/rolling it out behind an explicit opt-in env var.

Suggested change
Ok(v) if v.eq_ignore_ascii_case("false") => false,
_ => true, // default enabled
Ok(v) if v.eq_ignore_ascii_case("true") => true,
_ => false, // default disabled

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@FengPan-Frank
Copy link
Contributor Author

PR is moved from #24743

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copilot AI review requested due to automatic review settings March 9, 2026 10:26
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

@yxieca
Copy link
Contributor

yxieca commented Mar 11, 2026

/azpw ms_conflict

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

sidecar config issue

Signed-off-by: Feng Pan <fenpan@microsoft.com>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kperumalbfn kperumalbfn merged commit f0f4ec6 into sonic-net:master Mar 13, 2026
20 checks passed
yue-fred-gao pushed a commit to yue-fred-gao/sonic-buildimage that referenced this pull request Mar 16, 2026
…fix (sonic-net#25962)

Add TELEMETRY_WATCHDOG_CERT_PROBE_ENABLED function into watchdog and fix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants