Skip to content

Implementation of a Monitoring Daemon for storage devices in SONiC switches#433

Merged
prgeor merged 31 commits intosonic-net:masterfrom
ashwnsri:ssdmon-daemon
May 31, 2024
Merged

Implementation of a Monitoring Daemon for storage devices in SONiC switches#433
prgeor merged 31 commits intosonic-net:masterfrom
ashwnsri:ssdmon-daemon

Conversation

@ashwnsri
Copy link
Contributor

@ashwnsri ashwnsri commented Feb 9, 2024

Description

This commit adds a monitoring daemon for Storage device attributes on a device running SONiC.
SONiC Storage Monitoring Daemon HLD

Motivation and Context

Storage devices experience performance degradation over time on account of a variety of factors such as overall disk writes, bad-blocks management, lack of free space, sub-optimal operational temperature and good-old wear-and-tear which speaks to the overall health of the disk.

The goal of the Storage Monitoring Daemon (storagemond) is to provide meaningful metrics for the aforementioned issues and enable streaming telemetry for these attributes so that the required preventative measures are triggered in the eventuality of performance degradation.

How Has This Been Tested?

Has been manually tested on following platforms:

7050cx3.txt
S6100.txt
SN2700.txt

Additional Information (Optional)

@ashwnsri ashwnsri changed the title Implementation of a Storage Monitoring Daemon for storage devices in SONiC switches Implementation of a Monitoring Daemon for storage devices in SONiC switches Feb 9, 2024
@ashwnsri
Copy link
Contributor Author

@assrinivasan please add more details for manual testing.

sonc image upgrade, reboot, crash, fast/warm reboot

Added to the PR.

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented May 30, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

@ashwnsri
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).


STORAGEUTIL_LOAD_ERROR = 127

log = syslogger.SysLogger(SYSLOG_IDENTIFIER)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan can we move this inside daemon calss?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest


if value is None: self.log_warning("{}:{} value = None in StateDB".format(storage_device, field))

self.statedb_storage_info_loaded = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan what if the value is None, in that case we should fall back to .json on the disk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this in latest. Also added a None check in the _load_fsio_rw_json function for None values. In this scenario, Both StateDB and JSON file have junk values, so it will be considered an init case.

Comment on lines +191 to +200
if self.statedb_storage_info_loaded == False and self.fsio_json_file_loaded == True:
self.use_fsio_json_baseline = True
self.use_statedb_baseline = False

# If stormond is coming back up after a daemon crash, storage information would be saved in the
# STATE_DB. In that scenario, we use the STATE_DB information as the SoT and reconcile the FSIO
# reads and writes values.
elif self.statedb_storage_info_loaded == True:
self.use_fsio_json_baseline = False
self.use_statedb_baseline = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@assrinivasan can you make the logic more clear, i.e, if the stats are available in STATE_DB, then use that and as a fallback use .json values from the backup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in latest

@ashwnsri
Copy link
Contributor Author

/azpw run

@mssonicbld
Copy link
Collaborator

/AzurePipelines run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

prgeor
prgeor previously approved these changes May 30, 2024
@prgeor prgeor merged commit f41ecca into sonic-net:master May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants