Skip to content

Add orchagent heart beat message for watchdog.#2737

Merged
liuh-80 merged 9 commits intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat
Jun 6, 2023
Merged

Add orchagent heart beat message for watchdog.#2737
liuh-80 merged 9 commits intosonic-net:masterfrom
liuh-80:dev/liuh/add-heart-beat

Conversation

@liuh-80
Copy link
Contributor

@liuh-80 liuh-80 commented Apr 17, 2023

What I did
Improve orch agent: output heartbeat message to systemd.

Why I did it
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

How I verified it
Pass all UT.
Manually validate the heartbeat message works correctly.

Details if related
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306

@liuh-80 liuh-80 changed the title Add orchagent heart beat message [POC] Add orchagent heart beat message Apr 17, 2023
@liuh-80 liuh-80 requested a review from qiluo-msft April 28, 2023 10:16
@liuh-80 liuh-80 changed the title [POC] Add orchagent heart beat message Add orchagent heart beat message for watchdog. Apr 28, 2023
@liuh-80 liuh-80 marked this pull request as ready for review April 28, 2023 10:18
@liuh-80 liuh-80 requested a review from prsunny as a code owner April 28, 2023 10:18

void OrchDaemon::heartBeat(std::chrono::time_point<std::chrono::high_resolution_clock> tcurrent)
{
static auto tlast = std::chrono::high_resolution_clock::now();
Copy link
Contributor

@qiluo-msft qiluo-msft May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static

You are assuming OrchDaemon has only single instance in the process. To be super safe, you can use a static member variable instead of a static function variable. #Closed

Copy link
Contributor Author

@liuh-80 liuh-80 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change to a static member variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for misleading.

To be super safe, you can use a static member variable instead of a static function variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, change to none static member.

#define DEFAULT_MAX_BULK_SIZE 1000
size_t gMaxBulkSize = DEFAULT_MAX_BULK_SIZE;

std::chrono::time_point<std::chrono::high_resolution_clock> OrchDaemon::m_lastHeartBeat = std::chrono::high_resolution_clock::now();
Copy link
Contributor

@qiluo-msft qiluo-msft May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m_lastHeartBeat

std::chrono::time_pointstd::chrono::high_resolution_clock OrchDaemon::m_lastHeartBeat = std::chrono::high_resolution_clock::now();


If not static, should it be initialized in ctor? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, initialized in ctor

@liuh-80 liuh-80 force-pushed the dev/liuh/add-heart-beat branch from 4af1876 to 7f17fd4 Compare May 31, 2023 00:07
@liuh-80 liuh-80 merged commit 99a2a26 into sonic-net:master Jun 6, 2023
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 6, 2023
…ave issue. (#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
qiluo-msft pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Jun 13, 2023
…ave issue. (#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
theasianpianist pushed a commit to theasianpianist/sonic-swss that referenced this pull request Jul 20, 2023
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#14686)

This PR depends on sonic-net/sonic-swss#2737 merge first.

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.
Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
sonic-otn pushed a commit to sonic-otn/sonic-buildimage that referenced this pull request Sep 20, 2023
…ave issue. (sonic-net#15429)

Add watchdog mechanism to swss service and generate alert when swss have issue. 

**Work item tracking**
Microsoft ADO (number only): 16578912

**What I did**
Add orchagent watchdog to monitor and alert orchagent stuck issue.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.

Manually test process_monitoring/test_critical_process_monitoring.py can pass.

Add new UT sonic-net/sonic-mgmt#8306 to check watchdog works correctly.

Manually test, after pause orchagent with 'kill -STOP <pid>', check there are warning message exist in log:

Apr 28 23:36:41.504923 vlab-01 ERR swss#supervisor-proc-watchdog-listener: Process 'orchagent' is stuck in namespace 'host' (1.0 minutes).

**Details if related**
Heartbeat message PR: sonic-net/sonic-swss#2737
UT PR: sonic-net/sonic-mgmt#8306
Janetxxx pushed a commit to Janetxxx/sonic-swss that referenced this pull request Nov 10, 2025
**What I did**
Improve orch agent: output heartbeat message to systemd.

**Why I did it**
Currently SONiC monit system only monit orchagent process exist or not. If orchagent process stuck and stop processing, current monit can't find and report it.

**How I verified it**
Pass all UT.
Manually validate the heartbeat message works correctly.

**Details if related**
Another inprogress PR will add watchdog for this heartbeat message:
sonic-net/sonic-buildimage#14686

sonic-mgmt UT PR: sonic-net/sonic-mgmt#8306
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants