Skip to content

[dhcp_relay] add retry to checking process dhcprelayd#25470

Merged
yxieca merged 1 commit intosonic-net:masterfrom
Xichen96:dev/xichenlin/fix-dhcprelayd
Mar 21, 2026
Merged

[dhcp_relay] add retry to checking process dhcprelayd#25470
yxieca merged 1 commit intosonic-net:masterfrom
Xichen96:dev/xichenlin/fix-dhcprelayd

Conversation

@Xichen96
Copy link
Contributor

Why I did it

Occasionally, the psutil lib will through error when trying to read name. Retrying fix the problem.

Work item tracking
  • Microsoft ADO (number only):

How I did it

Retry the operation

How to verify it

Applied it on production machine and fixed the issue

Which release branch to backport (provide reason below if selected)

  • 202305
  • 202311
  • 202405
  • 202411
  • 202505
  • 202511

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@Xichen96 Xichen96 requested a review from lguohan as a code owner February 12, 2026 15:27
Copilot AI review requested due to automatic review settings February 12, 2026 15:27
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Xichen Lin <lukelin0907@gmail.com>
@Xichen96 Xichen96 force-pushed the dev/xichenlin/fix-dhcprelayd branch from 89abea4 to e8961b2 Compare February 12, 2026 15:28
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates dhcprelayd’s process-checking logic to reduce failures caused by transient psutil exceptions when reading process metadata, aiming to avoid unnecessary dhcp_relay container restarts.

Changes:

  • Add a retry loop around psutil process inspection (proc.name() / ppid() / cmdline()) in _check_dhcp_relay_processes.
  • Re-raise the last encountered exception after retrying up to 5 times.

@Xichen96
Copy link
Contributor Author

Xichen96 commented Mar 2, 2026

@yxieca Hi Ying, please help review and merge this pr

@yxieca
Copy link
Contributor

yxieca commented Mar 3, 2026

AI agent on behalf of Ying: Reviewed the change in dhcprelayd.py. The retry loop looks reasonable, but consider adding a small backoff sleep between retries to avoid tight looping when psutil hits transient errors. Also consider narrowing the retried exception types (for example AccessDenied vs unexpected exceptions) so we do not mask real bugs. Otherwise looks fine.

@Xichen96
Copy link
Contributor Author

.
@yxieca Hi Ying

  1. we are not masking real bugs, as we will throw out the exception at the end, if retry doesn't help
  2. the error we expect to happen doesn't seem to be related to wait time, it's kind of just a random thing. so retry without additional wait time works when patching the production device

@yxieca yxieca merged commit c558955 into sonic-net:master Mar 21, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants