Skip to content

Add watchdog mechanism to swss service and generate alert when swss have issue. #15429

Merged
qiluo-msft merged 1 commit intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat
Jun 13, 2023
Merged

Add watchdog mechanism to swss service and generate alert when swss have issue. #15429
qiluo-msft merged 1 commit intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat

Conversation

@liuh-80
Copy link
Copy Markdown
Contributor

@liuh-80 liuh-80 commented Jun 12, 2023

Add watchdog mechanism to swss service and generate alert when swss have issue.

Work item tracking
Microsoft ADO (number only): 16578912

What I did
Add orchagent watchdog to monitor and alert orchagent stuck issue.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP ', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

Details if related
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 requested a review from qiluo-msft June 12, 2023 08:46
@liuh-80 liuh-80 marked this pull request as ready for review June 12, 2023 08:46
@liuh-80 liuh-80 requested a review from lguohan as a code owner June 12, 2023 08:46
@qiluo-msft qiluo-msft merged commit 05f1a5a into sonic-net:master Jun 13, 2023
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
@mint570
Copy link
Copy Markdown
Contributor

mint570 commented Dec 13, 2024

This PR introduces some log span. Filed #21157.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants