Skip to content

Integrate SHOW API probe into telemetry watchdog#24053

Merged
qiluo-msft merged 5 commits intosonic-net:masterfrom
lixiaoyuner:dev/yunli1/integrate-show-cmd-into-telemetry-watchdog-probe
Sep 25, 2025
Merged

Integrate SHOW API probe into telemetry watchdog#24053
qiluo-msft merged 5 commits intosonic-net:masterfrom
lixiaoyuner:dev/yunli1/integrate-show-cmd-into-telemetry-watchdog-probe

Conversation

@lixiaoyuner
Copy link
Contributor

@lixiaoyuner lixiaoyuner commented Sep 19, 2025

Why I did it

Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.

Work item tracking
  • Microsoft ADO (number only): 34793098

How I did it

Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog http service for each probe, as long as one xpath fails, the probe will return 500.

  • The watchdog will read xpath list from /cmd_list.json and environment variables.

How to verify it

Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/ to check if the output is expected or not.

root@bjw-can-2700-2:/# curl -s http://127.0.0.1:50080/ |jq .
{
  "check_telemetry_port": "OK",
  "xpath_commands": [
    {
      "xpath": "clock",
      "success": true,
      "error": null,
      "duration_ms": 200
    }
  ]
}

Which release branch to backport (provide reason below if selected)

  • 202205
  • 202211
  • 202305
  • 202311
  • 202405
  • 202411
  • 202505

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -1,5 +1,5 @@
{% from "dockers/dockerfile-macros.j2" import install_debian_packages, install_python_wheels, copy_files %}
ARG BASE=docker-config-engine-bookworm-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}}
ARG BASE=docker-sonic-gnmi-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Integrate SHOW API probe into telemetry watchdog. We want the watch dog can catch the bad SHOW command implement.

->

Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Integrate SHOW API probe into telemetry watchdog. We want the watch dog can catch the bad SHOW command implement.

->

Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.

Thanks, fixed.

@@ -1,5 +1,5 @@
{% from "dockers/dockerfile-macros.j2" import install_debian_packages, install_python_wheels, copy_files %}
ARG BASE=docker-config-engine-bookworm-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}}
ARG BASE=docker-sonic-gnmi-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add DEPENDS in rule to make sure GNMI is built first for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add DEPENDS in rule to make sure GNMI is built first for this.

Thanks, fixed

Err(e) => eprintln!("Failed to parse {}: {}", CMD_LIST_JSON, e),
}
},
Err(e) => eprintln!("Could not read {}: {} (will continue with env var only)", CMD_LIST_JSON, e),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this kind of error be visible to user as well? return in json resp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this kind of error be visible to user as well? return in json resp

Put the error into JSON resp


// dedupe while preserving order
let mut seen = std::collections::HashSet::new();
list.retain(|x| seen.insert(x.clone()));
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't insert to a hashset above directly? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't insert to a hashset above directly?

Yes, it's better to use a HashSet directly, updated the code.

if !blk.is_empty() {
list.retain(|x| !blk.contains(x));
}
}
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need this?

I'd like to keep this, we may encounter a case that need to ignore some API, we can update the environment variables via k8s yaml file which will be very quick, no need to wait for code change.

}

fn run_gnmi_for_xpath(xpath: &str, port: u16, sec: &TelemetrySecurityConfig, target_name: &str, timeout: Duration) -> CommandResult {
// Build full command: gnmi_get -xpath_target SHOW -xpath <xpath> -target_addr 127.0.0.1:<port> -logtostderr -insecure true
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longer than 120 characters #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longer than 120 characters

Thanks, fixed all the lines which are too long.

Ok(c) => c,
Err(e) => {
eprintln!("Redis client error (security): {e}");
return TelemetrySecurityConfig { use_client_auth: false, ca_crt: DEFAULT_CA_CRT.to_string(), server_crt: DEFAULT_SERVER_CRT.to_string(), server_key: DEFAULT_SERVER_KEY.to_string() };
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TelemetrySecurityConfig { use_client_auth: false, ca_crt: DEFAULT_CA_CRT.to_string(), server_crt: DEFAULT_SERVER_CRT.to_string(), server_key: DEFAULT_SERVER_KEY.to_string() };

nit: we should just construct this once and use it in all the failure cases #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, fixed.

TelemetrySecurityConfig { use_client_auth: false, ca_crt: DEFAULT_CA_CRT.to_string(), server_crt: DEFAULT_SERVER_CRT.to_string(), server_key: DEFAULT_SERVER_KEY.to_string() };

nit: we should just construct this once and use it in all the failure cases

let mut get_field = |hash: &str, field: &str| -> Option<String> {
let r: redis::RedisResult<Option<String>> = conn.hget(hash, field);
match r { Ok(v) => v, Err(e) => { eprintln!("Redis HGET error {hash}.{field}: {e}"); None } }
};
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this pattern has appeared in this file multiple times - can we refactor it to a common function? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this pattern has appeared in this file multiple times - can we refactor it to a common function?

Thanks, refactored.

if !res.success { http_status = "HTTP/1.1 500 Internal Server Error"; }
cmd_results.push(res);
}
}
Copy link

@make1980 make1980 Sep 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you introduce another env var to enable/disable this? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a SHOW API probe switch and enable probe as default.

Copy link

@make1980 make1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🕐

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@@ -0,0 +1,60 @@
{
"xpaths": [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue with this is that

  1. We arent testing options with the xpath in this case, user can pass in options for example --interface.
  2. There are required arguments for some commands that need to be passed in, and these are usually like interface name. I dont know if the best way would to be hardcode an interface to check for all devices like Ethernet0, or if we can grab an oper up/admin up one from the host?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue with this is that

  1. We arent testing options with the xpath in this case, user can pass in options for example --interface.
  2. There are required arguments for some commands that need to be passed in, and these are usually like interface name. I dont know if the best way would to be hardcode an interface to check for all devices like Ethernet0, or if we can grab an oper up/admin up one from the host?

You are right, the commands with parameters are complicated, we don't plan to support for now, we support these common xpath at this stage. We can add more xpath with specific fixed parameters as you mentioned if we need later.

}
} else {
if let Err(e) = fs::read_to_string(CMD_LIST_JSON) {
let msg = format!(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to read the file again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to read the file again?

That's indeed a bad implementation, fixed.

}
} else {
// Possibly large stderr; truncate if huge (optional threshold 16KB).
let mut truncated = if stderr_string.len() > 16 * 1024 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 * 1024

please define this as a constant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

16 * 1024

please define this as a constant

Fixed.

Copy link

@make1980 make1980 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

Cherry-pick PR to msft-202412: Azure/sonic-buildimage-msft#1709

r12f pushed a commit to Azure/sonic-buildimage-msft that referenced this pull request Oct 11, 2025
…1708)

<!--
Please make sure you've read and understood our contributing guidelines:
     https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md

** Make sure all your commits include a signature generated with `git
commit -s` **

If this is a bug fix, make sure your description includes "fixes #xxxx",
or
     "closes #xxxx" or "resolves #xxxx"

     Please provide the following information:
-->

#### Why I did it
Manual cherrypick for
sonic-net/sonic-buildimage#24053 due to
conflict.

Integrate SHOW API probe into telemetry watchdog. We want the watchdog
to catch the bad SHOW command implementation.

##### Work item tracking
- Microsoft ADO **(number only)**:34793098

#### How I did it
Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog
http service for each probe, as long as one xpath fails, the probe will
return 500.

The watchdog will read xpath list from /cmd_list.json and environment
variables.

#### How to verify it
Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/
to check if the output is expected or not.

root@bjw-can-2700-2:/# curl -s http://127.0.0.1:50080/ |jq .
{
  "check_telemetry_port": "OK",
  "xpath_commands": [
    {
      "xpath": "clock",
      "success": true,
      "error": null,
      "duration_ms": 200
    }
  ]
}

<!--
If PR needs to be backported, then the PR must be tested against the
base branch and the earliest backport release branch and provide tested
image version on these two branches. For example, if the PR is requested
for master, 202211 and 202012, then the requester needs to provide test
results on master and 202012.
-->

#### Which release branch to backport (provide reason below if selected)

<!--
- Note we only backport fixes to a release branch, *not* features!
- Please also provide a reason for the backporting below.
- e.g.
- [x] 202006
-->

- [ ] 201811
- [ ] 201911
- [ ] 202006
- [ ] 202012
- [ ] 202106
- [ ] 202111
- [ ] 202205
- [ ] 202211

#### Tested branch (Please provide the tested image version)

<!--
- Please provide tested image version
- e.g.
- [x] 20201231.100
-->

- [ ] <!-- image version 1 -->
- [ ] <!-- image version 2 -->

#### Description for the changelog
<!--
Write a short (one line) summary that describes the changes in this
pull request for inclusion in the changelog:
-->

<!--
Ensure to add label/tag for the feature raised. example - PR#2174 under
sonic-utilities repo. where, Generic Config and Update feature has been
labelled as GCU.
-->

#### Link to config_db schema for YANG module changes
<!--
Provide a link to config_db schema for the table for which YANG model
is defined
Link should point to correct section on
https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md
-->

#### A picture of a cute animal (not mandatory but encouraged)
@r12f
Copy link
Contributor

r12f commented Oct 17, 2025

hi @lixiaoyuner , do you mind to help with a manual pick to 202412? the auto pick detected some conflicts on the code:

image

@make1980
Copy link

make1980 commented Oct 17, 2025 via email

@FengPan-Frank
Copy link
Contributor

FengPan-Frank commented Oct 18, 2025 via email

@r12f
Copy link
Contributor

r12f commented Oct 19, 2025

thanks! updated tag.

FengPan-Frank pushed a commit to FengPan-Frank/sonic-buildimage that referenced this pull request Dec 4, 2025
Why I did it
Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.

Work item tracking
Microsoft ADO (number only): 34793098
How I did it
Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog http service for each probe, as long as one xpath fails, the probe will return 500.

The watchdog will read xpath list from /cmd_list.json and environment variables.
How to verify it
Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/ to check if the output is expected or not.

root@bjw-can-2700-2:/# curl -s http://127.0.0.1:50080/ |jq .
{
  "check_telemetry_port": "OK",
  "xpath_commands": [
    {
      "xpath": "clock",
      "success": true,
      "error": null,
      "duration_ms": 200
    }
  ]
}

Signed-off-by: Feng Pan <fenpan@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants