Skip to content

[bug fix][test_container_checker] change config of monit to stablize the test #7#4008

Merged
lguohan merged 1 commit intosonic-net:masterfrom
JibinBao:fix_container_check
Aug 13, 2021
Merged

[bug fix][test_container_checker] change config of monit to stablize the test #7#4008
lguohan merged 1 commit intosonic-net:masterfrom
JibinBao:fix_container_check

Conversation

@JibinBao
Copy link
Copy Markdown
Contributor

Description of PR

Summary:
Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test.

Fixes # (issue)

Type of change

  • Bug fix
  • Testbed and Framework(new/improvement)
  • Test case(new/improvement)

Back port request

  • 201911

Approach

What is the motivation for this PR?

Stabilize test_container_checker by changing some config of Monit.

How did you do it?

Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down.
Changing the start delay to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start.

## Start Monit in the background (run as a daemon):
#
  set daemon 10             # check services at 1-minute intervals
    with start delay 10    # we delay Monit to start monitoring for 5 minutes
                            # intentionally such that all containers and processes
                            # have ample time to start up.
#

Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages.

check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles

How did you verify/test it?

run test:
py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv --allow_recover

Any platform specific information?

Supported testbed topology if it's a new test case?

Documentation

@JibinBao JibinBao requested a review from a team as a code owner August 11, 2021 07:32
@lguohan lguohan merged commit 867f87c into sonic-net:master Aug 13, 2021
vmittal-msft pushed a commit to vmittal-msft/sonic-mgmt that referenced this pull request Sep 28, 2021
…the test sonic-net#7 (sonic-net#4008)

Because the Monit sampling interval is too long (60s), and the syncd container restart time is rather short (sometimes it just needs about 30s), and the alert message rule is too strict, so sometimes Monit can not monitoring syncd down for 2 times for 2 mins and there are no syncd alert messages in syslog. By changing the relevant config of Monit, we can stabilize the test.

Changing the sampling intervals to 10 in /etc/monit/monitrc ensures that the Monit can monitor syncd container down.
Changing the start delay  to 10 in /etc/monit/monitrc ensures that the Monit start quicker than syncd start.


```
## Start Monit in the background (run as a daemon):
#
  set daemon 10             # check services at 1-minute intervals
    with start delay 10    # we delay Monit to start monitoring for 5 minutes
                            # intentionally such that all containers and processes
                            # have ample time to start up.
#
```
Changing the rule of alerting messages in /etc/monit/conf.d/sonic-host makes it is easy to send alert messages.
```
check program container_checker with path "/usr/bin/container_checker"
    if status != 0 for 1 times within 1 cycles then alert repeat every 1 cycles

```
#### How did you verify/test it?
run test:
`py.test container_checker/test_container_checker.py --inventory "../ansible/inventory, ../ansible/veos" --host-pattern arc-switch1025 --module-path                ../ansible/library/ --testbed arc-switch1025-t0 --testbed_file ../ansible/testbed.csv                --allow_recover`
kazinator-arista pushed a commit to kazinator-arista/sonic-mgmt that referenced this pull request Mar 4, 2026
…atically (#23559)

#### Why I did it
src/sonic-utilities
```
* a3989b42 - (HEAD -> 202505, origin/202505) Exclude Smart Switch from modular chassis operations/checks (sonic-net#4008) (73 minutes ago) [mssonicbld]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants