Skip to content

[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord.#6849

Merged
yozhao101 merged 3 commits intosonic-net:masterfrom
yozhao101:deduplicate_syslog_messages
Feb 25, 2021
Merged

[Supervisord] Deduplicate the alerting messages of critical processes from Supervisord.#6849
yozhao101 merged 3 commits intosonic-net:masterfrom
yozhao101:deduplicate_syslog_messages

Conversation

@yozhao101
Copy link
Contributor

@yozhao101 yozhao101 commented Feb 23, 2021

Signed-off-by: Yong Zhao [email protected]

Why I did it

In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it

This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it

I verified and test this implementation on a physical DUT.

Feb 23 01:24:36.541111 str-dx010-acs-1 INFO lldp#supervisord 2021-02-23 01:24:36,540 INFO exited: lldp-syncd (terminated by SIGKILL; not expected)
Feb 23 01:24:36.543880 str-dx010-acs-1 INFO lldp#supervisord 2021-02-23 01:24:36,543 INFO exited: lldpmgrd (terminated by SIGKILL; not expected)
Feb 23 01:25:36.616111 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(1 minutes).
Feb 23 01:25:36.616207 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(1 minutes).
Feb 23 01:26:36.673000 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(2 minutes).
Feb 23 01:26:36.673443 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(2 minutes).
Feb 23 01:27:36.730690 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(3 minutes).
Feb 23 01:27:36.730817 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(3 minutes).
Feb 23 01:28:36.782367 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldp-syncd' is not running in namespace 'host'(4 minutes).
Feb 23 01:28:36.782818 str-dx010-acs-1 ERR lldp#supervisor-proc-exit-listener: Process 'lldpmgrd' is not running in namespace 'host'(4 minutes).

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • [x ] 202012

Description for the changelog

A picture of a cute animal (not mandatory but encouraged)

@yozhao101 yozhao101 requested a review from lguohan as a code owner February 23, 2021 01:31
@yozhao101 yozhao101 requested a review from jleveque February 23, 2021 01:33
1.Fix the format of alerting message.
2.For each exited process, there are two fields: the time of last alert
and number of dead minutes. Use a dict to hold these two fields instead
of a list.
3.Use a formula to calculate how many minutes the process was in dead
state instead of hard code.

Signed-off-by: Yong Zhao <[email protected]>
Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Please wait for other reviewers.

@yozhao101
Copy link
Contributor Author

Looks good to me. Please wait for other reviewers.

@lguohan Can you please help me review this change?

@yozhao101 yozhao101 merged commit 21f5e12 into sonic-net:master Feb 25, 2021
yxieca pushed a commit that referenced this pull request Mar 4, 2021
… from Supervisord. (#6849)

Signed-off-by: Yong Zhao [email protected]

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
carl-nokia pushed a commit to carl-nokia/sonic-buildimage that referenced this pull request Aug 7, 2021
… from Supervisord. (sonic-net#6849)

Signed-off-by: Yong Zhao [email protected]

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
lolyu pushed a commit to lolyu/sonic-buildimage that referenced this pull request Sep 13, 2021
… from Supervisord. (sonic-net#6849)

Signed-off-by: Yong Zhao [email protected]

Why I did it
In the configuration of rsyslog, duplicate messages will be suppressed and reported in the format of message repeated n times.
Due to this behavior, if a critical process in a container exited unexpectedly, the alerting message will be written into syslog once
and not be written into syslog anymore until the second critical process exited. This PR aims to differentiate these alerting messages such that they will not be suppressed by rsyslogd and can appear in the syslog periodically.

How I did it
This PR adds a counter into the alerting message and shows how many minutes a critical process was not running.

How to verify it
I verified and test this implementation on a physical DUT.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants