Integrate SHOW API probe into telemetry watchdog#24053
Conversation
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| @@ -1,5 +1,5 @@ | |||
| {% from "dockers/dockerfile-macros.j2" import install_debian_packages, install_python_wheels, copy_files %} | |||
| ARG BASE=docker-config-engine-bookworm-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}} | |||
| ARG BASE=docker-sonic-gnmi-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}} | |||
There was a problem hiding this comment.
nit:
Integrate SHOW API probe into telemetry watchdog. We want the watch dog can catch the bad SHOW command implement.
->
Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.
There was a problem hiding this comment.
nit:
Integrate SHOW API probe into telemetry watchdog. We want the watch dog can catch the bad SHOW command implement.
->
Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.
Thanks, fixed.
| @@ -1,5 +1,5 @@ | |||
| {% from "dockers/dockerfile-macros.j2" import install_debian_packages, install_python_wheels, copy_files %} | |||
| ARG BASE=docker-config-engine-bookworm-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}} | |||
| ARG BASE=docker-sonic-gnmi-{{DOCKER_USERNAME}}:{{DOCKER_USERTAG}} | |||
There was a problem hiding this comment.
better to add DEPENDS in rule to make sure GNMI is built first for this.
There was a problem hiding this comment.
better to add DEPENDS in rule to make sure GNMI is built first for this.
Thanks, fixed
| Err(e) => eprintln!("Failed to parse {}: {}", CMD_LIST_JSON, e), | ||
| } | ||
| }, | ||
| Err(e) => eprintln!("Could not read {}: {} (will continue with env var only)", CMD_LIST_JSON, e), |
There was a problem hiding this comment.
should this kind of error be visible to user as well? return in json resp
There was a problem hiding this comment.
should this kind of error be visible to user as well? return in json resp
Put the error into JSON resp
|
|
||
| // dedupe while preserving order | ||
| let mut seen = std::collections::HashSet::new(); | ||
| list.retain(|x| seen.insert(x.clone())); |
There was a problem hiding this comment.
why don't insert to a hashset above directly? #Closed
There was a problem hiding this comment.
why don't insert to a hashset above directly?
Yes, it's better to use a HashSet directly, updated the code.
| if !blk.is_empty() { | ||
| list.retain(|x| !blk.contains(x)); | ||
| } | ||
| } |
There was a problem hiding this comment.
do we really need this?
I'd like to keep this, we may encounter a case that need to ignore some API, we can update the environment variables via k8s yaml file which will be very quick, no need to wait for code change.
| } | ||
|
|
||
| fn run_gnmi_for_xpath(xpath: &str, port: u16, sec: &TelemetrySecurityConfig, target_name: &str, timeout: Duration) -> CommandResult { | ||
| // Build full command: gnmi_get -xpath_target SHOW -xpath <xpath> -target_addr 127.0.0.1:<port> -logtostderr -insecure true |
There was a problem hiding this comment.
longer than 120 characters
Thanks, fixed all the lines which are too long.
| Ok(c) => c, | ||
| Err(e) => { | ||
| eprintln!("Redis client error (security): {e}"); | ||
| return TelemetrySecurityConfig { use_client_auth: false, ca_crt: DEFAULT_CA_CRT.to_string(), server_crt: DEFAULT_SERVER_CRT.to_string(), server_key: DEFAULT_SERVER_KEY.to_string() }; |
There was a problem hiding this comment.
There was a problem hiding this comment.
Thanks, fixed.
TelemetrySecurityConfig { use_client_auth: false, ca_crt: DEFAULT_CA_CRT.to_string(), server_crt: DEFAULT_SERVER_CRT.to_string(), server_key: DEFAULT_SERVER_KEY.to_string() };
nit: we should just construct this once and use it in all the failure cases
| let mut get_field = |hash: &str, field: &str| -> Option<String> { | ||
| let r: redis::RedisResult<Option<String>> = conn.hget(hash, field); | ||
| match r { Ok(v) => v, Err(e) => { eprintln!("Redis HGET error {hash}.{field}: {e}"); None } } | ||
| }; |
There was a problem hiding this comment.
nit: this pattern has appeared in this file multiple times - can we refactor it to a common function? #Closed
There was a problem hiding this comment.
nit: this pattern has appeared in this file multiple times - can we refactor it to a common function?
Thanks, refactored.
| if !res.success { http_status = "HTTP/1.1 500 Internal Server Error"; } | ||
| cmd_results.push(res); | ||
| } | ||
| } |
There was a problem hiding this comment.
could you introduce another env var to enable/disable this? #Closed
There was a problem hiding this comment.
Added a SHOW API probe switch and enable probe as default.
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| @@ -0,0 +1,60 @@ | |||
| { | |||
| "xpaths": [ | |||
There was a problem hiding this comment.
Issue with this is that
- We arent testing options with the xpath in this case, user can pass in options for example --interface.
- There are required arguments for some commands that need to be passed in, and these are usually like interface name. I dont know if the best way would to be hardcode an interface to check for all devices like Ethernet0, or if we can grab an oper up/admin up one from the host?
There was a problem hiding this comment.
Issue with this is that
- We arent testing options with the xpath in this case, user can pass in options for example --interface.
- There are required arguments for some commands that need to be passed in, and these are usually like interface name. I dont know if the best way would to be hardcode an interface to check for all devices like Ethernet0, or if we can grab an oper up/admin up one from the host?
You are right, the commands with parameters are complicated, we don't plan to support for now, we support these common xpath at this stage. We can add more xpath with specific fixed parameters as you mentioned if we need later.
| } | ||
| } else { | ||
| if let Err(e) = fs::read_to_string(CMD_LIST_JSON) { | ||
| let msg = format!( |
There was a problem hiding this comment.
why do we need to read the file again?
There was a problem hiding this comment.
why do we need to read the file again?
That's indeed a bad implementation, fixed.
| } | ||
| } else { | ||
| // Possibly large stderr; truncate if huge (optional threshold 16KB). | ||
| let mut truncated = if stderr_string.len() > 16 * 1024 { |
There was a problem hiding this comment.
16 * 1024
please define this as a constant
Fixed.
|
/azp run Azure.sonic-buildimage |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Cherry-pick PR to msft-202412: Azure/sonic-buildimage-msft#1709 |
…1708) <!-- Please make sure you've read and understood our contributing guidelines: https://github.com/Azure/SONiC/blob/gh-pages/CONTRIBUTING.md ** Make sure all your commits include a signature generated with `git commit -s` ** If this is a bug fix, make sure your description includes "fixes #xxxx", or "closes #xxxx" or "resolves #xxxx" Please provide the following information: --> #### Why I did it Manual cherrypick for sonic-net/sonic-buildimage#24053 due to conflict. Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation. ##### Work item tracking - Microsoft ADO **(number only)**:34793098 #### How I did it Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog http service for each probe, as long as one xpath fails, the probe will return 500. The watchdog will read xpath list from /cmd_list.json and environment variables. #### How to verify it Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/ to check if the output is expected or not. root@bjw-can-2700-2:/# curl -s http://127.0.0.1:50080/ |jq . { "check_telemetry_port": "OK", "xpath_commands": [ { "xpath": "clock", "success": true, "error": null, "duration_ms": 200 } ] } <!-- If PR needs to be backported, then the PR must be tested against the base branch and the earliest backport release branch and provide tested image version on these two branches. For example, if the PR is requested for master, 202211 and 202012, then the requester needs to provide test results on master and 202012. --> #### Which release branch to backport (provide reason below if selected) <!-- - Note we only backport fixes to a release branch, *not* features! - Please also provide a reason for the backporting below. - e.g. - [x] 202006 --> - [ ] 201811 - [ ] 201911 - [ ] 202006 - [ ] 202012 - [ ] 202106 - [ ] 202111 - [ ] 202205 - [ ] 202211 #### Tested branch (Please provide the tested image version) <!-- - Please provide tested image version - e.g. - [x] 20201231.100 --> - [ ] <!-- image version 1 --> - [ ] <!-- image version 2 --> #### Description for the changelog <!-- Write a short (one line) summary that describes the changes in this pull request for inclusion in the changelog: --> <!-- Ensure to add label/tag for the feature raised. example - PR#2174 under sonic-utilities repo. where, Generic Config and Update feature has been labelled as GCU. --> #### Link to config_db schema for YANG module changes <!-- Provide a link to config_db schema for the table for which YANG model is defined Link should point to correct section on https://github.com/Azure/sonic-buildimage/blob/master/src/sonic-yang-models/doc/Configuration.md --> #### A picture of a cute animal (not mandatory but encouraged)
|
hi @lixiaoyuner , do you mind to help with a manual pick to 202412? the auto pick detected some conflicts on the code:
|
|
@feng ***@***.***> - could you help with this?
Thanks,
Michael
From: Riff ***@***.***>
Sent: Friday, October 17, 2025 11:35 AM
To: sonic-net/sonic-buildimage ***@***.***>
Cc: Michael Ma ***@***.***>; Comment ***@***.***>
Subject: Re: [sonic-net/sonic-buildimage] Integrate SHOW API probe into telemetry watchdog (PR #24053)
[https://avatars.githubusercontent.com/u/1533278?s=20&v=4]r12f left a comment (sonic-net/sonic-buildimage#24053)<#24053 (comment)>
hi @lixiaoyuner<https://github.com/lixiaoyuner> , do you mind to help with a manual pick to 202412? the auto pick detected some conflicts on the code:
image.png (view on web)<https://github.com/user-attachments/assets/41e831ef-99bc-4ee5-b6c2-3a2136a0468a>
-
Reply to this email directly, view it on GitHub<#24053 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZ62PEYR2PV3E6Q57GIBXBL3YEZETAVCNFSM6AAAAACG6B2DL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJWGY3DGNBRG4>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
|
|
@r12f @make1980 For this PR, I have already cherrypicked it into 202412 manually
[202412] [PR:24053] Integrate SHOW API probe into telemetry watchdog by FengPan-Frank · Azure/sonic-buildimage-msft<Azure/sonic-buildimage-msft#1708>
Thanks,
Feng
…________________________________
From: Michael Ma ***@***.***>
Sent: Saturday, October 18, 2025 6:17 AM
To: sonic-net/sonic-buildimage ***@***.***>
Cc: Feng Pan ***@***.***>; Comment ***@***.***>
Subject: Re: [sonic-net/sonic-buildimage] Integrate SHOW API probe into telemetry watchdog (PR #24053)
[https://avatars.githubusercontent.com/u/108898195?s=20&v=4]make1980 left a comment (sonic-net/sonic-buildimage#24053)<#24053 (comment)>
@feng ***@***.***> - could you help with this?
Thanks,
Michael
From: Riff ***@***.***>
Sent: Friday, October 17, 2025 11:35 AM
To: sonic-net/sonic-buildimage ***@***.***>
Cc: Michael Ma ***@***.***>; Comment ***@***.***>
Subject: Re: [sonic-net/sonic-buildimage] Integrate SHOW API probe into telemetry watchdog (PR #24053)
[https://avatars.githubusercontent.com/u/1533278?s=20&v=4]r12f left a comment (sonic-net/sonic-buildimage#24053)<#24053 (comment)>
hi @lixiaoyuner<https://github.com/lixiaoyuner> , do you mind to help with a manual pick to 202412? the auto pick detected some conflicts on the code:
image.png (view on web)<https://github.com/user-attachments/assets/41e831ef-99bc-4ee5-b6c2-3a2136a0468a>
-
Reply to this email directly, view it on GitHub<#24053 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AZ62PEYR2PV3E6Q57GIBXBL3YEZETAVCNFSM6AAAAACG6B2DL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJWGY3DGNBRG4>.
You are receiving this because you commented.Message ID: ***@***.******@***.***>>
—
Reply to this email directly, view it on GitHub<#24053 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AXEWCYEK7UJDZ2OKX3VM2NL3YE6DHAVCNFSM6AAAAACG6B2DL2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMJWG44TCNZZHE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
|
thanks! updated tag. |
Why I did it Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation. Work item tracking Microsoft ADO (number only): 34793098 How I did it Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog http service for each probe, as long as one xpath fails, the probe will return 500. The watchdog will read xpath list from /cmd_list.json and environment variables. How to verify it Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/ to check if the output is expected or not. root@bjw-can-2700-2:/# curl -s http://127.0.0.1:50080/ |jq . { "check_telemetry_port": "OK", "xpath_commands": [ { "xpath": "clock", "success": true, "error": null, "duration_ms": 200 } ] } Signed-off-by: Feng Pan <fenpan@microsoft.com>

Why I did it
Integrate SHOW API probe into telemetry watchdog. We want the watchdog to catch the bad SHOW command implementation.
Work item tracking
How I did it
Add gnmi_get into telemetry-watchdog, and call gnmi_get in the watchdog http service for each probe, as long as one xpath fails, the probe will return 500.
How to verify it
Run the telemetry-watchdog container, call curl http://127.0.0.1:50080/ to check if the output is expected or not.
Which release branch to backport (provide reason below if selected)
Tested branch (Please provide the tested image version)
Description for the changelog
Link to config_db schema for YANG module changes
A picture of a cute animal (not mandatory but encouraged)