Skip to content

Support for Memory Statisticsd Process and Configurations#20355

Open
Arham-Nasir wants to merge 27 commits intosonic-net:masterfrom
Arham-Nasir:feature/memory-statistics-daemon-process
Open

Support for Memory Statisticsd Process and Configurations#20355
Arham-Nasir wants to merge 27 commits intosonic-net:masterfrom
Arham-Nasir:feature/memory-statistics-daemon-process

Conversation

@Arham-Nasir
Copy link
Contributor

@Arham-Nasir Arham-Nasir commented Sep 26, 2024

The Memory_Statisticsd Process was added to enhance system monitoring by collecting memory usage data at configurable intervals. This will allow the system to store historical data, making it easier to analyze memory trends and debug memory-related issues.

@Arham-Nasir
Copy link
Contributor Author

Hi @qiluo-msft , @prgeor , and @FengPan-Frank ,

Could you please help review the HLD and the linked code PRs for our feature?

Thank you!

memory_statistics = self.collect_memory_statistics()
self.store_memory_statistics(memory_statistics)
except Exception as e:
self.logger.log(f"Error collecting or storing memory statistics: {e}", logging.ERROR)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be repeated in dead loop and generate bunch of logs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your feedback. The memory statistics daemon is designed with robust loop control to prevent dead loops. It collects data at a configurable interval, logs it in compressed format for historical data, and shuts down gracefully when triggered. The loop control mechanism ensures it waits according to the sampling interval and only continues running until a shutdown signal is received. Potential issues could arise from misconfigurations (e.g., zero or negative sampling intervals) but the daemon is designed to handle such cases effectively.

Arham-Nasir and others added 10 commits October 14, 2024 18:32
…:Arham-Nasir/sonic-buildimage into feature/memory-statistics-daemon-process

Merge branch 'feature/memory-statistics-daemon-process'
- Integrated changes from the latest upstream.
- Added SyslogLogger class for syslog message logging.
…handling, time delta, and memory size formatting.

Signed-off-by: Arham-Nasir <[email protected]>
…or and MemoryStatisticsCollector class

Signed-off-by: Arham-Nasir <[email protected]>
@ridahanif96
Copy link
Contributor

@qiluo-msft @xincunli-sonic pls help review this feature

@Arham-Nasir
Copy link
Contributor Author

Hi @FengPan-Frank and @qiluo-msft ,
When you have a moment, could you please review this PR? I would appreciate your feedback and suggestions. Thank you for your time and guidance

@FengPan-Frank
Copy link
Contributor

Hi @FengPan-Frank and @qiluo-msft , When you have a moment, could you please review this PR? I would appreciate your feedback and suggestions. Thank you for your time and guidance

Thanks for the work, could you add test case to cover collector service code?

@Arham-Nasir
Copy link
Contributor Author

Hi @FengPan-Frank and @qiluo-msft , When you have a moment, could you please review this PR? I would appreciate your feedback and suggestions. Thank you for your time and guidance

Thanks for the work, could you add test case to cover collector service code?

Thank you for the review and suggestion, I’ll add test cases for the collector service code. Please let me know if there are any specific scenarios you'd like me to cover.

@FengPan-Frank
Copy link
Contributor

Hi @FengPan-Frank and @qiluo-msft , When you have a moment, could you please review this PR? I would appreciate your feedback and suggestions. Thank you for your time and guidance

Thanks for the work, could you add test case to cover collector service code?

Thank you for the review and suggestion, I’ll add test cases for the collector service code. Please let me know if there are any specific scenarios you'd like me to cover.

Just common sanity check to cover the daemon functions.

@@ -0,0 +1,10 @@
# Memory Statistics Daemon Configuration
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it better if these configurations can put into config db instead of a static conf file? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have stored configurations in both config_db and a static conf file for flexibility and reliability.

At startup, the daemon reads default settings from the conf file to ensure a consistent state after a restart. During runtime, the daemon dynamically updates its configuration from config_db without needing a restart. This approach combines the stability of predefined defaults with the convenience of live updates, ensuring predictable behavior and seamless adaptability.

@@ -0,0 +1,1877 @@
#!/usr/bin/env python3
import psutil
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you sort import as nit purpose? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review. I have organized the imports following the standard Python import grouping practice: standard library imports, third-party imports, and local application imports, sorted alphabetically within each group for improved readability and maintainability.

"""Logs a message with the 'DEBUG' level."""
self.log(syslog.LOG_DEBUG, message)

def close_logger(self):
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

close_logger

There is no caller for this close method? #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed it by adding context manager methods (enter, exit) to manage the syslog connection lifecycle, ensuring it opens and closes cleanly. Additionally, the close method now ensures proper resource management when used explicitly.

logger.log_info(f"Removing outdated entry: {old_entry}")
del total_dict['system_memory'][memory_type][old_entry]

total_dict['system_memory']['count'] = len(total_dict['system_memory'].get('system', {}))
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

count

A constant? #Closed

logger.log_info("Accepted new connection")
self.handle_connection(connection)
except socket.timeout:
continue
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue

Add log in case timeout happen, but lost debuggability. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback! I've added a logger.log_debug for socket.timeout to improve debuggability.

"""
self.pid_file = pid_file

def daemonize(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

daemonize

Add validation to confirm the daemonization process succeeded and log an appropriate error if it did not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @xincunli-sonic ,

The enhanced Daemonizer class is designed to ensure reliable daemon creation, with robust validation for process isolation, session management, and file descriptor redirection.

Thank you.

if slot < num_columns:
self.add_entry_to_time_group_list(time_entry_summary, slot, memory_entry)

def aggregate_data(self, request_data, time_entry_summary, num_columns):
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregate_data

In aggregate_data, the logic for adding entries to the time_group_list could be inefficient for large datasets due to multiple nested calls and list manipulations.
Suggestion: Use batch processing or caching mechanisms to reduce redundant operations and improve performance. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your valuable feedback and suggestion to optimize the aggregate_data function. I have updated the code by implementing batch processing with dynamic sizing to reduce redundant operations, minimizing nested calls for better performance, and adding robust error handling with detailed logging. Let me know if further adjustments are needed.

logger.log_info(f"Service initialized with name: {self.name}")

self.config_file_path = config_file_path
self.memory_statistics_lock = threading.Lock()
Copy link
Contributor

@xincunli-sonic xincunli-sonic Dec 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory_statistics_lock

Please check all code for shared resource access and ensure that locks are applied consistently, especially when modifying global state. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pointing this out. I have carefully reviewed the code and ensured that locks are applied consistently for all shared resource access.

Added ThreadSafeConfig class for safe configuration access and consistently used memory_statistics_lock during memory statistics collection to prevent race conditions.

Let me know if further adjustments are needed

if os.path.exists(pid_file):
os.unlink(pid_file)
logger.log_info(f"Removed PID file: {pid_file}")
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception as

The same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced except Exception with specific exceptions for improved clarity and control.

try:
with gzip.open(memory_statistics_config['TOTAL_MEMORY_STATISTICS_LOG_FILENAME'], 'wt', encoding='utf-8') as jfile:
json.dump(total_dict, jfile)
except Exception as e:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except Exception

still too general Exception

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.
I’ve replaced all instances of the generic except Exception with more specific exceptions to ensure better error handling and clarity. Please let me know if you have any further suggestions.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Arham-Nasir <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@lguohan
Copy link
Collaborator

lguohan commented May 6, 2025

need to rethink the approach, this looks like reinvent the wheel of time series database on the sonic itself.

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Arham-Nasir <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Arham-Nasir <[email protected]>
@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run Azure.sonic-buildimage

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@Arham-Nasir
Copy link
Contributor Author

need to rethink the approach, this looks like reinvent the wheel of time series database on the sonic itself.

We want to clarify that the purpose of memorystatsd is not to build a time-series database (TSDB) inside the SONiC Redis DB. Instead, memorystatsd is designed to be a simple and lightweight solution that fits well in SONiC environments.

To keep the system efficient and reduce complexity, memorystatsd avoids using Redis for storing data.

Instead, it saves compressed JSON snapshots to /var/log/memorystats/memory-stats.json.gz.

Each snapshot is very compact approximately 225 bytes resulting in a total of around 63 KB per day. Depending on the configured retention period, the storage footprint remains minimal.

Old data is automatically deleted based on the set retention period, helping to meet platform limits and keep resource usage low.

@wajahatrazi
Copy link
Contributor

Dear @qiluo-msft

This is the only PR left to be merged as per the scope of the HLD, hoping for your support.

@qiluo-msft
Copy link
Collaborator

I left a comment in Add HLD for Memory Statistics Enhancement with New Metrics, Leak Detection, and gNMI Access by Arham-Nasir · Pull Request #1962 · sonic-net/SONiC
Comment raised in community review meeting: why develop a new daemon MemoryStatsd for monitor memory data, which is overlapping with existing procdockerd? Could this new feature design reusing existing daemon or combined them into one daemon?
I see some code PRs are already merged, but this comment is still applicable.

Please resolve this issue at high priority

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants