Corefile uploader service (#3887)#3914
Corefile uploader service (#3887)#3914renukamanavalan wants to merge 1 commit intosonic-net:201811from
Conversation
* Corefile uploader service 1) A service is added to watch /var/core and upload to Azure storage 2) The service is disabled on boot. One may enable explicitly. 3) The .rc file to be updated with acct credentials and http proxy to use. 4) If service is enabled with no credentials, it would sleep, with periodic log messages 5) For any update in .rc, the service has to be restarted to take effect. * Remove rw permission for .rc file for group & others. * Changes per review comments. Re-ordered .rc file per JSON.dump order. Added a script to enable partial update of .rc, which HWProxy would use to add acct key. * Azure storage upload requires python module futures, hence added it to install list. * Removed trailing spaces. * A mistake in name corrected. Copy the .rc updater script to /usr/bin.
| "core_upload": "/tmp/core_upload/" | ||
| }, | ||
| "azure_sonic_core_storage": { | ||
| "account_name": "corefilecollection", |
There was a problem hiding this comment.
can you use template for both name and share_name?
There was a problem hiding this comment.
You can use /usr/bin/update_json.py to update any of the entries.
| sudo cp $IMAGE_CONFIGS/hostcfgd/*.j2 $FILESYSTEM_ROOT_USR_SHARE_SONIC_TEMPLATES/ | ||
|
|
||
| # copy core file uploader files | ||
| sudo cp $IMAGE_CONFIGS/corefile_uploader/core_uploader.service $FILESYSTEM_ROOT/etc/systemd/system/ |
There was a problem hiding this comment.
it would be better to put all these into sonic-utilities. In the build image repo, we just need to install the core_uploader service.
There was a problem hiding this comment.
I leave core_uploader.py & its rc file here, as they are part of .service.
Moved update_json.py to sonic-utilities.
|
|
||
|
|
||
| @staticmethod | ||
| def wait_for_file_write_complete(path): |
There was a problem hiding this comment.
a better way is to check the update time of the file and compare with current time, if it is more than 10 minutes, then it likely means the file is complete.
There was a problem hiding this comment.
ctime is not create but modified time.
So I just changed it to be more safe.
Theoretically, there can't be one robust safe way, as it all depends on how the write happens.
| tar.close() | ||
| log_debug("Tar file for upload created: " + tarf_name) | ||
|
|
||
| Handler.upload_file(tarf_name, tarf_name) |
There was a problem hiding this comment.
is the core file removed after upload?
There was a problem hiding this comment.
No.
Reasons:
-
I am not attaching core file to ICM now, so as not to increase ICM DB size,. Hence we need it here.
-
With core files remaining, upon service restart, it will re upload,
a) Re-upload is not a problem, as the analyzer would understand that it is duplicate and will not file any ticket
b) If the file is still in azure storage, it would only get re-updated and no duplicates.
In a normal system, the service restart should be pretty seldom (only upon reboot) -
When s process dumps cores repeatedly, only one ticket is created. Even if we attach core file., we would attach only one. But often, we might like to see multiple dumps during core file analysis to help see a pattern.
On a side note: FYI:
5) We have an upcoming core-mgmt service from Broadcom that would limit the count of core files for a process
| log_debug("Remote file created: name{} path{}".format(fname, fpath)) | ||
| break | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
what kind of failures do we see, if it is access failure due to credential, then retry would not help.
There was a problem hiding this comment.
Being a http based service, there can be many different failures. So we retry 3 times with a pause in between. If we still could not, get back to main loop and waiting for next core.
There was a problem hiding this comment.
can you try invalid user name and password, and see what error message we will have. at least if it is account issue, we should print out some message for alert. you want differentiate the network temp error v.s. systematic error, for example no storage space, account error, so that system managers can treat those message differently.
…atically (#22925) #### Why I did it src/sonic-utilities ``` * 6bd88046 - (HEAD -> 202505, origin/202505) Support reboot cause: Kernel Panic - Out of memory (#3921) (2 days ago) [mssonicbld] * 28ae9cef - [db_migrator] Add version_202411_02 (#3912) (5 days ago) [mssonicbld] * 4586e973 - [intfstat] Align output format between cached/non-cached scenarios (#3913) (5 days ago) [mssonicbld] * 11b6fc0c - [Mellanox] Collect sai.xml to sysdump (#3914) (5 days ago) [mssonicbld] ``` #### How I did it #### How to verify it #### Description for the changelog
Remove rw permission for .rc file for group & others.
Changes per review comments.
Re-ordered .rc file per JSON.dump order.
Added a script to enable partial update of .rc, which HWProxy would use to add acct key.
Azure storage upload requires python module futures, hence added it to install list.
Removed trailing spaces.
A mistake in name corrected.
Copy the .rc updater script to /usr/bin.
- What I did
- How I did it
- How to verify it
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)