Skip to content

HLD for Memory_Statistics#1760

Merged
ridahanif96 merged 16 commits intosonic-net:masterfrom
Arham-Nasir:memory_stats
Dec 14, 2024
Merged

HLD for Memory_Statistics#1760
ridahanif96 merged 16 commits intosonic-net:masterfrom
Arham-Nasir:memory_stats

Conversation

@Arham-Nasir
Copy link
Contributor

@Arham-Nasir Arham-Nasir commented Jul 23, 2024

This High-Level Design (HLD) document explains the Memory Statistics feature in SONiC. The aim is to improve memory monitoring and management to optimize network performance and reliability.

The Linked PR are as follows:

Repo PR title State
sonic-host-services Support for Memory Statistics Host-Services GitHub issue/pull request detail
sonic-buildimage Add YANG Model and Configuration Support for Memory Statistics GitHub issue/pull request detail
sonic-buildimage Support for Memory Statisticsd Process and Configurations GitHub issue/pull request detail
sonic-utilities Memory Statistics Config and Show Commands GitHub issue/pull request detail
sonic-swss-common Add definition for MEMORY_STATISTICS table in schema GitHub issue/pull request detail

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 23, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@zhangyanzhao
Copy link
Collaborator

reviewer is expected and please leave your comments if you want to be a reviewer. Thanks.

@zhangyanzhao
Copy link
Collaborator

Comment on lines +267 to +277
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15
Value Value Value 01Jun24 03Jun24 05Jun24 07Jun24 09Jun24 11Jun24 13Jun24
-------------- -------- -------- -------- --------- --------- --------- --------- --------- --------- ---------
Total Memory 15.6G 15.6G 15.1G 15.1G 15.2G 15.3G 15.3G 15.4G 15.5G 15.5G
Used Memory 2.3G 2.5G 2.0G 2.1G 2.2G 2.2G 2.3G 2.4G 2.3G 2.2G
Free Memory 11.9G 12.4G 10.6G 11.0G 11.2G 11.4G 11.5G 11.7G 11.8G 11.9G
Available Memory 13.0G 14.3G 12.4G 12.6G 12.7G 12.8G 12.9G 13.0G 13.1G 13.2G
Cached Memory 1.2G 1.5G 1.0G 1.1G 1.2G 1.3G 1.4G 1.3G 1.4G 1.2G
Buffer Memory 0.3G 0.4G 0.2G 0.2G 0.3G 0.3G 0.4G 0.3G 0.3G 0.4G
Shared Memory 0.5G 0.6G 0.4G 0.5G 0.5G 0.5G 0.4G 0.5G 0.5G 0.5G

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Arham-Nasirare these stats collected from the output of free linux command?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data is not directly collected from the output of the free Linux command. Instead, they are gathered using the psutil Python library, which directly accesses system memory information via the same underlying mechanisms that the free command relies on. Specifically, the psutil.virtual_memory() function retrieves memory stats such as total, used, free, buffers, cached, and available memory.This approach allows for more direct access to memory data, avoiding the need to parse command output while providing flexibility for programmatic use.

Analysis Period: From 2024-06-01 09:00:00 to 2024-06-15 09:00:00
Interval: 2 days
--------------------------------------------------------------------------------
Metric Current High Low D1-D3 D3-D5 D5-D7 D7-D9 D9-D11 D11-D13 D13-D15
Copy link
Contributor

@qiluo-msft qiluo-msft Aug 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this design overlapping with existing procdockerd. We should reuse existing feature.

sonic-gnmi (telemetry) already has a comprehensive implementation https://github.com/sonic-net/sonic-gnmi/blob/64ed32b715e1673c3376701ee192d6292f5ec0ec/sonic_data_client/non_db_client.go#L235. Please check and ensure matrix are aligned.
Even better, suggest reuse:
1. delegate telemetry container as the memory collector
2. reduce this HLD scope to only memory-stats and storage, connecting to telemetry container by gnmi protocol
3. ensure this design could run inside and outside sonic switch
4. ensure this feature could be completely disabled inside a sonic switch and no performance burden if disabled

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Thank you for the suggestion. We will review the sonic-gnmi implementation to align our metrics with the existing telemetry framework for system consistency. We will also explore reusing the telemetry container for memory collection and narrow the HLD scope to focus on memory statistics and storage.
  • Currently, we use the psutil.virtual_memory() function from the psutil library to gather memory data, which is similar to the getProcMeminfo() function in non_db_client.go, as both methods pull information from /proc/meminfo. We believe that psutil offers a more efficient and Pythonic approach compared to relying solely on the telemetry container for this purpose.
  • Can you elaborate more on how the design could run inside and outside the sonic switch.

Copy link
Contributor

@qiluo-msft qiluo-msft Nov 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this design rely on telemetry container as the memory stats collector, it can run inside or outside sonic switch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this thread is still applicable.

Copy link
Contributor Author

@Arham-Nasir Arham-Nasir Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To provide some context on the current design, we gather memory statistics directly from SONiC using the psutil library. The collected data is stored as gzip files in the /var/log directory and is accessible via show commands through Unix sockets.

We appreciate your suggestion to leverage the telemetry container. As part of enhancements for this feature, we will certainly consider integrating with the telemetry container and aligning the design with your recommendations.

Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main issue / concern is that this design overlapping with existing procdockerd. We should reuse existing feature. I will refine above comments, I did not insist that reusing telemetry is only solution to reuse procdockerd feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In enable/disable case, do you still use SIGHUP which means memstatsd will keep running? or just SIGTERM the memoryStats daemon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hostcfgd will detect the enable/disable flag in ConfigDB. For disable, it sends SIGTERM to stop the daemon, where the daemon gracefully stops all processes, releases resources, and completely shuts down. For enable, it restarts the daemon using systemd, and SIGHUP is used for configuration reloads while the daemon is running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, better to mention it explicitly, memorystats is not daemon but process controlled by hostcfgd, what if hostcfgd crashes, will it reload memorystats and recalculate retention?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was a misunderstanding. Yes, we are introducing memorystatsd as a process and not a separate container. If the hostcfgd crashes, memorystats will restart and when the user enables memory statistics feature, it will use the default retention time unless the user explicitly mentions a custom retention period. Lets say, user enters 5 as retention period, then it will start the retention period from that time and will show the last 5 days data.

@Arham-Nasir
Copy link
Contributor Author

Arham-Nasir commented Sep 26, 2024

@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature.

Signed-off-by: Arham-Nasir <100487254+Arham-Nasir@users.noreply.github.com>
@ridahanif96
Copy link
Collaborator

ridahanif96 commented Oct 25, 2024

@qiluo-msft @xincunli-sonic @zbud-msft @FengPan-Frank We have linked the code PRs. Please help review the feature.
Target 202411, help with expedite review.

@ridahanif96
Copy link
Collaborator

ridahanif96 commented Nov 6, 2024

@qiluo-msft @xincunli-sonic @FengPan-Frank Please help review the feature.
Target 202411, help with expedite review.

@ridahanif96
Copy link
Collaborator

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

@FengPan-Frank
Copy link
Contributor

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

Pls get final signoff from Qi, I think code PR will need to be merged before HLD.

@ridahanif96
Copy link
Collaborator

@FengPan-Frank please help review and approve this feature. Code PRs are already in review. Target is 202411. We are running out of time for HLD review. Please help expedite review.

Pls get final signoff from Qi, I think code PR will need to be merged before HLD.

@qiluo-msft help with this HLD review.

| sampling_interval | unit16 | Interval for memory data collection. |
| retention_period | unit16 | Duration for which memory data is retained. |
| sampling_interval | unit8 | Interval for memory data collection. |
| retention_period | unit8 | Duration for which memory data is retained. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the units for interval and period? in seconds? IF yes, is uint8 too small?
typo: unit8->uint8

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable feedback. To clarify:

  • The sampling_interval is provided by the user in minutes, with a configurable range from 3 to 15 minutes.
  • The retention_period is provided by the user in days, with a configurable range from 1 to 30 days.

Since the values for sampling_interval and retention_period fall within the range of uint8 (0 to 255), we believe uint8 is appropriate as the data type for representing the user input.

Additionally, the typo has been corrected to uint8 as suggested and the descriptions have been updated for further clarity

@zhangyanzhao
Copy link
Collaborator

code PRs and HLD PR are not merged, move to backlog

Copy link
Contributor

@xincunli-sonic xincunli-sonic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ridahanif96 ridahanif96 merged commit 07851ff into sonic-net:master Dec 14, 2024
@anilpannala anilpannala moved this from 🏗 In Progress to ✅ Done in SONiC 202505 Release May 30, 2025
@zhangyanzhao zhangyanzhao moved this from ✅ Done to MovedToBacklog in SONiC 202505 Release Jun 8, 2025
@zhangyanzhao
Copy link
Collaborator

One code PR is not merged yet, move the feature to "MoveToBacklog" @anilpannala

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: MovedToBacklog
Status: MovedToBacklog

Development

Successfully merging this pull request may close these issues.

9 participants