-
Notifications
You must be signed in to change notification settings - Fork 3k
Delete extracted files when loading dataset #2631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
stas00
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for implementing this time and space-saving strategy, @albertvillanova
My only request is that there will be a way for a user to disable the auto-delete if they need to. For example, if something goes wrong and they need to analyze the extracted files they probably should be able to pass some flag that says not to delete the extracted files?
|
Sure @stas00, it is still a draft pull request. :) |
|
Yes, I noticed it after reviewing - my apologies. |
|
The problem with this approach is that it also deletes the downloaded files (if they need not be extracted). 😟 |
Right! These probably should not be deleted by default, but having an option for those users who are tight on disc space? |
I propose leaving that for another PR, and leave this one handling only with "extracted" files. Is it OK for you? :) |
|
Awesome thanks ! So for now I would just make |
|
@lhoestq, current implementation only deletes extracted "files", not extracted "directories", as it uses: For audio datasets, the audio files are inside the extracted "directory", so they are not deleted. |
|
I'm still more in favor of having
Let me know what you think ! |
|
I think the main issue is that after doing some work users typically move on to other datasets and the amount of disc space used keeps on growing. So your logic is very sound and perhaps what's really needed is a cleansweep function that can go through all datasets and clean them up to the desired degree:
So a user can launch a little application, choose what they want to clean up and voila they have just freed up a huge amount of disc space. Makes me think of Ubuntu Tweak's Janitor app - very useful. At the moment, this process of linting is very daunting and error-prone, especially due to all those dirs/files with hash names. |
|
@stas00 I've had the same idea. Instead of the full-fledged app, a simpler approach would be to add a new command to the CLI. |
|
oh, CLI would be perfect. I didn't mean to request a GUI-one specifically, was just using it as an example. One could even do a crontab to delete old datasets that haven't been accesses in X months. |
|
@lhoestq I totally agree with you. I'm addressing that change. @stas00, @mariosasko, that could eventually be addressed in another pull request. The objective of this PR is:
|
|
I also like the idea of having a CLI tool to help users clean their cache and save disk space, good idea ! |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! LGTM :)
Close #2481, close #2604, close #2591.
cc: @stas00, @thomwolf, @BirgerMoell