Skip to content

Conversation

@albertvillanova
Copy link
Member

@albertvillanova albertvillanova commented Jul 12, 2021

Close #2481, close #2604, close #2591.

cc: @stas00, @thomwolf, @BirgerMoell

Copy link
Contributor

@stas00 stas00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for implementing this time and space-saving strategy, @albertvillanova

My only request is that there will be a way for a user to disable the auto-delete if they need to. For example, if something goes wrong and they need to analyze the extracted files they probably should be able to pass some flag that says not to delete the extracted files?

@albertvillanova
Copy link
Member Author

Sure @stas00, it is still a draft pull request. :)

@stas00
Copy link
Contributor

stas00 commented Jul 12, 2021

Yes, I noticed it after reviewing - my apologies.

@albertvillanova
Copy link
Member Author

The problem with this approach is that it also deletes the downloaded files (if they need not be extracted). 😟

@albertvillanova albertvillanova marked this pull request as ready for review July 13, 2021 14:39
@stas00
Copy link
Contributor

stas00 commented Jul 13, 2021

The problem with this approach is that it also deletes the downloaded files (if they need not be extracted). worried

Right! These probably should not be deleted by default, but having an option for those users who are tight on disc space?

@albertvillanova
Copy link
Member Author

Right! These probably should not be deleted by default, but having an option for those users who are tight on disc space?

I propose leaving that for another PR, and leave this one handling only with "extracted" files. Is it OK for you? :)

@lhoestq
Copy link
Member

lhoestq commented Jul 15, 2021

Awesome thanks !
I just have one question: what about image/audio datasets for which we store the path to the extracted file on the arrow data ?
In this case the default should be to keep the extracted files.

So for now I would just make keep_extracted=True by default until we have a way to separate extracted files that can be deleted and extracted files that are actual resources of the dataset.

@albertvillanova
Copy link
Member Author

@lhoestq, current implementation only deletes extracted "files", not extracted "directories", as it uses: os.remove(path). I'm going to add a filter on files, so that this line does not throw an exception when passed a directory.

For audio datasets, the audio files are inside the extracted "directory", so they are not deleted.

@lhoestq
Copy link
Member

lhoestq commented Jul 16, 2021

I'm still more in favor of having keep_extracted=True by default:

  • When working with a dataset, you call load_dataset many times. By default we want to keep objects extracted to not extract them over and over again (it can take a long time). Then once you know what you're doing and you want to optimize disk space, you can do keep_extracted=False. Deleting the extracted files by default is a regression that can lead to slow downs for people calling load_dataset many times, which is common when experimenting
  • This behavior doesn't sound natural as a default behavior. In the rest of the library, things are cached and not removed unless you explicitly say do (map caching for example). Moreover the function in the download manager is called download_and_extract, not download_and_extract_and_remove_extracted_files

Let me know what you think !

@stas00
Copy link
Contributor

stas00 commented Jul 16, 2021

I think the main issue is that after doing some work users typically move on to other datasets and the amount of disc space used keeps on growing. So your logic is very sound and perhaps what's really needed is a cleansweep function that can go through all datasets and clean them up to the desired degree:

  • delete all extracted files
  • delete all sources
  • delete all caches
  • delete all caches that haven't been accessed in 6 months
  • delete completely old datasets that haven't been accessed in 6 months
  • more?

So a user can launch a little application, choose what they want to clean up and voila they have just freed up a huge amount of disc space. Makes me think of Ubuntu Tweak's Janitor app - very useful.

At the moment, this process of linting is very daunting and error-prone, especially due to all those dirs/files with hash names.

@mariosasko
Copy link
Collaborator

@stas00 I've had the same idea. Instead of the full-fledged app, a simpler approach would be to add a new command to the CLI.

@stas00
Copy link
Contributor

stas00 commented Jul 16, 2021

oh, CLI would be perfect. I didn't mean to request a GUI-one specifically, was just using it as an example.

One could even do a crontab to delete old datasets that haven't been accesses in X months.

@albertvillanova
Copy link
Member Author

@lhoestq I totally agree with you. I'm addressing that change.

@stas00, @mariosasko, that could eventually be addressed in another pull request. The objective of this PR is:

  • add an option to pass to load_dataset, so that extracted files are deleted
  • do this deletion file per file, once the file has been already used to generate the cache Arrow file

@lhoestq
Copy link
Member

lhoestq commented Jul 19, 2021

I also like the idea of having a CLI tool to help users clean their cache and save disk space, good idea !

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! LGTM :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

4 participants