Delete extracted files when loading dataset #2631

albertvillanova · 2021-07-12T16:39:33Z

Close #2481, close #2604, close #2591.

…2604

src/datasets/utils/patching.py

stas00

Thank you for implementing this time and space-saving strategy, @albertvillanova

My only request is that there will be a way for a user to disable the auto-delete if they need to. For example, if something goes wrong and they need to analyze the extracted files they probably should be able to pass some flag that says not to delete the extracted files?

albertvillanova · 2021-07-12T21:11:15Z

Sure @stas00, it is still a draft pull request. :)

stas00 · 2021-07-12T21:18:44Z

Yes, I noticed it after reviewing - my apologies.

albertvillanova · 2021-07-13T08:01:57Z

The problem with this approach is that it also deletes the downloaded files (if they need not be extracted). 😟

stas00 · 2021-07-13T19:02:50Z

The problem with this approach is that it also deletes the downloaded files (if they need not be extracted). worried

Right! These probably should not be deleted by default, but having an option for those users who are tight on disc space?

albertvillanova · 2021-07-15T08:53:40Z

Right! These probably should not be deleted by default, but having an option for those users who are tight on disc space?

I propose leaving that for another PR, and leave this one handling only with "extracted" files. Is it OK for you? :)

lhoestq · 2021-07-15T09:54:40Z

Awesome thanks !
I just have one question: what about image/audio datasets for which we store the path to the extracted file on the arrow data ?
In this case the default should be to keep the extracted files.

So for now I would just make keep_extracted=True by default until we have a way to separate extracted files that can be deleted and extracted files that are actual resources of the dataset.

albertvillanova · 2021-07-16T14:46:48Z

@lhoestq, current implementation only deletes extracted "files", not extracted "directories", as it uses: os.remove(path). I'm going to add a filter on files, so that this line does not throw an exception when passed a directory.

For audio datasets, the audio files are inside the extracted "directory", so they are not deleted.

lhoestq · 2021-07-16T18:28:03Z

I'm still more in favor of having keep_extracted=True by default:

When working with a dataset, you call load_dataset many times. By default we want to keep objects extracted to not extract them over and over again (it can take a long time). Then once you know what you're doing and you want to optimize disk space, you can do keep_extracted=False. Deleting the extracted files by default is a regression that can lead to slow downs for people calling load_dataset many times, which is common when experimenting
This behavior doesn't sound natural as a default behavior. In the rest of the library, things are cached and not removed unless you explicitly say do (map caching for example). Moreover the function in the download manager is called download_and_extract, not download_and_extract_and_remove_extracted_files

Let me know what you think !

stas00 · 2021-07-16T18:38:07Z

I think the main issue is that after doing some work users typically move on to other datasets and the amount of disc space used keeps on growing. So your logic is very sound and perhaps what's really needed is a cleansweep function that can go through all datasets and clean them up to the desired degree:

delete all extracted files
delete all sources
delete all caches
delete all caches that haven't been accessed in 6 months
delete completely old datasets that haven't been accessed in 6 months
more?

So a user can launch a little application, choose what they want to clean up and voila they have just freed up a huge amount of disc space. Makes me think of Ubuntu Tweak's Janitor app - very useful.

At the moment, this process of linting is very daunting and error-prone, especially due to all those dirs/files with hash names.

mariosasko · 2021-07-16T19:33:51Z

@stas00 I've had the same idea. Instead of the full-fledged app, a simpler approach would be to add a new command to the CLI.

stas00 · 2021-07-16T19:43:41Z

oh, CLI would be perfect. I didn't mean to request a GUI-one specifically, was just using it as an example.

One could even do a crontab to delete old datasets that haven't been accesses in X months.

albertvillanova · 2021-07-19T08:11:48Z

@lhoestq I totally agree with you. I'm addressing that change.

@stas00, @mariosasko, that could eventually be addressed in another pull request. The objective of this PR is:

add an option to pass to load_dataset, so that extracted files are deleted
do this deletion file per file, once the file has been already used to generate the cache Arrow file

lhoestq · 2021-07-19T09:02:26Z

I also like the idea of having a CLI tool to help users clean their cache and save disk space, good idea !

lhoestq

Thanks ! LGTM :)

albertvillanova added 5 commits July 12, 2021 18:19

Test load_dataset deletes extracted files

1d82a20

Create patching submodule

7512fa8

Implement patching for deleting file on close

dfc151b

Use patching for deleting file on close in ArrowBasedBuilder

174ca18

Merge remote-tracking branch 'upstream/master' into huggingfacegh-2481-…

9aeb283

…2604

stas00 reviewed Jul 12, 2021

View reviewed changes

src/datasets/utils/patching.py Outdated Show resolved Hide resolved

stas00 approved these changes Jul 12, 2021

View reviewed changes

albertvillanova added 6 commits July 13, 2021 11:33

Implement delete_extracted_paths in DownloadManager

891492b

Delete extracted files with DownloadManager instead of patching

fc5e7f7

Add dummy delete_extracted_paths to MockDownloadManager

6217fea

Rename delete_extracted_paths to delete_extracted_files

8b2cffd

Add option to keep extracted files

55f4421

Remove patching

4a2c529

albertvillanova marked this pull request as ready for review July 13, 2021 14:39

albertvillanova mentioned this pull request Jul 13, 2021

Refactor patching to specific submodule #2639

Merged

Test option to keep extracted files

afa07da

Filter extracted paths to files

903fee9

Set delete_extracted=False by default

3e19c8d

lhoestq approved these changes Jul 19, 2021

View reviewed changes

lhoestq merged commit 46b5010 into huggingface:master Jul 19, 2021

mariosasko mentioned this pull request Aug 9, 2022

Add a utility to list cached things huggingface/huggingface_hub#972

Closed

Delete extracted files when loading dataset #2631

Delete extracted files when loading dataset #2631

Uh oh!

Conversation

albertvillanova commented Jul 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stas00 left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova commented Jul 12, 2021

Uh oh!

stas00 commented Jul 12, 2021

Uh oh!

albertvillanova commented Jul 13, 2021

Uh oh!

stas00 commented Jul 13, 2021

Uh oh!

albertvillanova commented Jul 15, 2021

Uh oh!

lhoestq commented Jul 15, 2021

Uh oh!

albertvillanova commented Jul 16, 2021

Uh oh!

lhoestq commented Jul 16, 2021

Uh oh!

stas00 commented Jul 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko commented Jul 16, 2021

Uh oh!

stas00 commented Jul 16, 2021

Uh oh!

albertvillanova commented Jul 19, 2021

Uh oh!

lhoestq commented Jul 19, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented Jul 12, 2021 •

edited

Loading

stas00 commented Jul 16, 2021 •

edited

Loading