Skip to content

System Ready#10479

Merged
sujinmkang merged 6 commits intosonic-net:masterfrom
sg893052:SYSTEM_READY
May 20, 2022
Merged

System Ready#10479
sujinmkang merged 6 commits intosonic-net:masterfrom
sg893052:SYSTEM_READY

Conversation

@sg893052
Copy link
Copy Markdown
Contributor

@sg893052 sg893052 commented Apr 6, 2022

Why I did it

At present, there is no mechanism in an event driven model to know that the system is up with all the essential sonic services and also, all the docker apps are ready along with port ready status to start the network traffic. With the asynchronous architecture of SONiC, we will not be able to verify if the config has been applied all the way down to the HW. But we can get the closest up status of each app and arrive at the system readiness.

How I did it

A new python based system monitor tool is introduced under system-health framework to monitor all the essential system host services including docker wrapper services on an event based model and declare the system is ready. This framework gives provision for docker apps to notify its closest up status. CLIs are provided to fetch the current system status and also service running status and its app ready status along with failure reason if any.

How to verify it

"show system-health sysready-status" click CLI
Syslogs for system ready

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Apr 6, 2022

This pull request introduces 15 alerts when merging 590ee50a9547bf9d9fd88b478aeefdca37e0f571 into b152f2a - view on LGTM.com

new alerts:

  • 9 for Unused import
  • 2 for Variable defined multiple times
  • 1 for Wrong name for an argument in a call
  • 1 for Testing equality to None
  • 1 for Unused local variable
  • 1 for Unreachable code

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Could you please add comments for each new function/class?

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Could you please fix the LGTM issues?

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Could please re-format files? I see many format issues, for example:

SYSLOG_IDENTIFIER="system#monitor"

should be

SYSLOG_IDENTIFIER = "system#monitor"

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Junchao-Mellanox commented Apr 8, 2022

Could you please remove unused comments?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comment to explain the purpose here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use ConfigDBConnector in swsscommon.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use sonic_py_common.logger.Logger?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a special reason to use this class instead of directly using dict?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When will the file being closed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lock as it is not need here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the purpose of this sleep?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you need stop the event loop?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the event loop runs, as ProcessTaskBase in src/sonic-py-common/sonic_py_common/task_base.py has task_stop which actually kills the subprocess if not joined(exited) on time.

def task_stop(self):
    # Signal the process to stop
    self.task_stopping_event.set()

    # Wait for the process to exit
    self._task_process.join(self._stop_timeout_secs)

    # If the process didn't exit, attempt to kill it
    if self._task_process.is_alive():
        os.kill(self._task_process.pid, signal.SIGKILL)

    if self._task_process.is_alive():
        return False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the lock for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lock as it is not needed here.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use constant value instead of magic number

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sysmon.system_service is a infinite loop, the rest of the code in this function will never run.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taken care.

@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

could you please add unit test?

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Apr 24, 2022

This pull request introduces 4 alerts when merging 9388306a2644f61c903aa5b918f7ba5564af4af0 into 5cd6bc4 - view on LGTM.com

new alerts:

  • 2 for Unused import
  • 1 for Use of the return value of a procedure
  • 1 for Unused local variable

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate the reason to keep it as a list? Query/remove an element from set is faster than list.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spl_srv_list not used.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use output.splitlines()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, used glob in place of listdir and modified dnsrvs_name from list to set as per suggestion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you consider using hmset?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Apr 25, 2022

This pull request introduces 1 alert when merging ee0818c6f8ddc19ae926e7a128016717bd515f84 into 3fc3259 - view on LGTM.com

new alerts:

  • 1 for Unused import

@sg893052 sg893052 force-pushed the SYSTEM_READY branch 3 times, most recently from e6f68dd to 001b7ed Compare April 29, 2022 01:55
@lgtm-com
Copy link
Copy Markdown

lgtm-com bot commented Apr 29, 2022

This pull request introduces 1 alert when merging 001b7ed92156e6ff9099f8a254a0c711ddee122c into e0f5333 - view on LGTM.com

new alerts:

  • 1 for Unused import

@sg893052 sg893052 force-pushed the SYSTEM_READY branch 2 times, most recently from 95e5686 to 9782dbe Compare May 5, 2022 12:59
@sg893052
Copy link
Copy Markdown
Contributor Author

sg893052 commented May 5, 2022

Could please re-format files? I see many format issues, for example:

SYSLOG_IDENTIFIER="system#monitor"

should be

SYSLOG_IDENTIFIER = "system#monitor"

Taken care of all the format issues

@sg893052
Copy link
Copy Markdown
Contributor Author

sg893052 commented May 5, 2022

Could you please add comments for each new function/class?

Added comments for all functions and classes

@sg893052
Copy link
Copy Markdown
Contributor Author

sg893052 commented May 5, 2022

could you please add unit test?

Have added unit test coverage for the system ready code in test_system_health.py
Also, attaching the manual UT performed runtime on a target.

Manual UT runtime - system ready.txt

@adyeung
Copy link
Copy Markdown
Collaborator

adyeung commented May 6, 2022

/azpw run Azure.sonic-buildimage

1 similar comment
@sg893052
Copy link
Copy Markdown
Contributor Author

sg893052 commented May 7, 2022

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Bring in all global functions under Sysmonitor class
Add unit test code coverage for sysmonitor code
Formatted the code
@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Copy Markdown
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Copy Markdown
Contributor Author

/azpw run Azure.sonic-buildimage

@mssonicbld
Copy link
Copy Markdown
Collaborator

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sg893052
Copy link
Copy Markdown
Contributor Author

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 10479 in repo Azure/sonic-buildimage

@sg893052
Copy link
Copy Markdown
Contributor Author

/AzurePipelines run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Commenter does not have sufficient privileges for PR 10479 in repo Azure/sonic-buildimage

@liushilongbuaa
Copy link
Copy Markdown
Contributor

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@zhangyanzhao
Copy link
Copy Markdown

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

@sujinmkang sujinmkang merged commit f37dd77 into sonic-net:master May 20, 2022
description "This configuration controls the system ready tool to check
the app ready/up status";
type boolean;
default false;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add yang model unit test?
We recently fixed a bug #10986

liushilongbuaa pushed a commit to liushilongbuaa/sonic-buildimage that referenced this pull request Jun 20, 2022
…anch

Related work items: #52, #71, #73, #75, #77, sonic-net#1306, sonic-net#1588, sonic-net#1991, sonic-net#2031, sonic-net#2040, sonic-net#2053, sonic-net#2066, sonic-net#2069, sonic-net#2087, sonic-net#2107, sonic-net#2110, sonic-net#2112, sonic-net#2113, sonic-net#2117, sonic-net#2124, sonic-net#2125, sonic-net#2126, sonic-net#2128, sonic-net#2130, sonic-net#2131, sonic-net#2132, sonic-net#2133, sonic-net#2134, sonic-net#2135, sonic-net#2136, sonic-net#2137, sonic-net#2138, sonic-net#2139, sonic-net#2140, sonic-net#2143, sonic-net#2158, sonic-net#2161, sonic-net#2233, sonic-net#2243, sonic-net#2250, sonic-net#2254, sonic-net#2260, sonic-net#2261, sonic-net#2267, sonic-net#2278, sonic-net#2282, sonic-net#2285, sonic-net#2288, sonic-net#2289, sonic-net#2292, sonic-net#2294, sonic-net#8887, sonic-net#9279, sonic-net#9390, sonic-net#9511, sonic-net#9700, sonic-net#10025, sonic-net#10322, sonic-net#10479, sonic-net#10484, sonic-net#10493, sonic-net#10500, sonic-net#10580, sonic-net#10595, sonic-net#10628, sonic-net#10634, sonic-net#10635, sonic-net#10644, sonic-net#10670, sonic-net#10691, sonic-net#10716, sonic-net#10731, sonic-net#10750, sonic-net#10751, sonic-net#10752, sonic-net#10761, sonic-net#10769, sonic-net#10775, sonic-net#10776, sonic-net#10779, sonic-net#10786, sonic-net#10792, sonic-net#10793, sonic-net#10800, sonic-net#10806, sonic-net#10826, sonic-net#10839, sonic-net#10840, sonic-net#10842, sonic-net#10844, sonic-net#10847, sonic-net#10849, sonic-net#10852, sonic-net#10865, sonic-net#10872, sonic-net#10877, sonic-net#10886, sonic-net#10889, sonic-net#10903, sonic-net#10904, sonic-net#10905, sonic-net#10913, sonic-net#10914, sonic-net#10916, sonic-net#10919, sonic-net#10925, sonic-net#10926, sonic-net#10929, sonic-net#10933, sonic-net#10934, sonic-net#10937, sonic-net#10941, sonic-net#10947, sonic-net#10952, sonic-net#10953, sonic-net#10957, sonic-net#10959, sonic-net#10971, sonic-net#10972, sonic-net#10980
@Junchao-Mellanox
Copy link
Copy Markdown
Collaborator

Junchao-Mellanox commented Aug 29, 2025

Hi @sg893052 , the PR introduces some sub-processes to system-health daemon. We observed that each of the daemon uses many memory than it needs:

admin@sonic:~$ ps aux | grep healthd
root       13758  0.0  0.3 227860 58960 ?        Ss   Aug26   2:42 /usr/bin/python3 /usr/local/bin/healthd
root       14828  0.0  0.2 377720 46108 ?        Sl   Aug26   0:21 /usr/bin/python3 /usr/local/bin/healthd
root       14863  0.0  0.3 167212 49248 ?        S    Aug26   0:59 /usr/bin/python3 /usr/local/bin/healthd
root       14881  0.0  0.2 172304 47316 ?        S    Aug26   0:01 /usr/bin/python3 /usr/local/bin/healthd
root       14890  0.0  0.2 160936 41596 ?        S    Aug26   0:17 /usr/bin/python3 /usr/local/bin/healthd

Do you remember why did you decide to use multi-processing instead of multi-threading when you implement this feature?

@qiluo-msft
Copy link
Copy Markdown
Collaborator

@sg893052 could you check above issue?

@sg893052
Copy link
Copy Markdown
Contributor Author

I think, I chose multiprocessing over multithreading for this implementation because, during initial testing with threads, I observed that one of the monitoring threads (particularly the D-Bus listener) could occasionally become unresponsive — either due to blocking I/O, event loop stalls, or external system pressure. When this happened, it not only failed to push events into the queue, but in some cases also interfered with timely queue consumption, leading to missed or delayed system state updates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants