Add cache dir for in-memory datasets #2329

mariosasko · 2021-05-06T19:35:32Z

Adds the cache dir attribute to DatasetInfo as suggested by @lhoestq.

Should fix #2322

… add-cache-dir

lhoestq

Thanks ! That's a really good start :)

Maybe we can have _cache_dir as an attribute of the Dataset object itself instead of storing it in the info ?
The info are meant to be shared (for example we store them in the dataset_infos.json file of each dataset). Because of that I don't think we should end up having users paths in them.

mariosasko · 2021-05-11T12:56:03Z

Yes, having cache_dir as an attribute looks cleaner.

lhoestq

Thanks ! Looks good :)

I think the next step is to write tests... x)

Since in-memory dataset never had a way to fetch results from the cache in map for example, I think we might have to tweak a few things. For example I'm pretty sure that this could happen (as the code is right now on this PR):

you load a dataset with keep_in_memory=True
you apply map -> it stores the result in a cache file BUT you end up with a dataset that is not in-memory since the table is loaded via Dataset.from_file with the default in_memory parameter here:

datasets/src/datasets/arrow_dataset.py

Line 1925 in 6c5742c

return Dataset.from_file(cache_file_name, info=info, split=self.split)

Another one:

you load a dataset with keep_in_memory=False
you apply map -> it stores the result in a cache file
you reload the dataset but this time with keep_in_memory=True
you re-apply map with the same function -> it reloads from the cache BUT you end up with a dataset that is not in-memory since the table is loaded via Dataset.from_file with the default in_memory parameter here:

datasets/src/datasets/arrow_dataset.py

Line 1739 in 6c5742c

return Dataset.from_file(cache_file_name, info=info, split=self.split)

Finally, I wonder if we should add a use_caching parameter to load_dataset to give users the possibility to set caching independently of keep_in_memory. If use_caching is False, then _cache_dir would be None, independently of the value of keep_in_memory.

Let me know what you think !

I'm glad we finally get a chance to make the user experience better regarding caching. I feel like it wasn't ideal to have caching linked to keep_in_memory (in-memory dataset had no caching...)

mymusise · 2021-05-17T06:41:32Z

Good job! Looking forward to this new feature! 🥂

mariosasko · 2021-05-25T17:19:58Z

@lhoestq Sorry for the late reply. Yes, I'll start working on tests. Thanks for the detailed explanation of the current issues with caching (like the idea of adding the use_caching parameter to load_dataset)

… add-cache-dir

lhoestq

Let's try to merge this PR this week :)

It just misses some tests to make sure that the _cached_dir attribute is used as expected.
I can take care of the tests tomorrow once I'm done with the dataset streaming documentation.

Also let's revert the renaming of the in_memory parameter

lhoestq · 2021-06-07T19:48:38Z

src/datasets/arrow_dataset.py

        split: Optional[NamedSplit] = None,
        indices_filename: Optional[str] = None,
-        in_memory: bool = False,
+        keep_in_memory: bool = False,


This is a breaking change. I agree that keep_in_memory is nice for consistency but we try to avoid breaking changes as much as possible

… add-cache-dir

mariosasko · 2021-06-08T01:29:39Z

@lhoestq Sure. I'm aware this is a high-priority issue to some extent, so feel free to take over.

Few suggestions I have:

there is a slight difference between setting use_caching to False in load_dataset and disabling caching globally with set_caching_enabled(False) because the former will never execute the following code (self._cache_dir is always False):

datasets/src/datasets/arrow_dataset.py

Lines 1807 to 1824 in c231abd

    
           if self._cache_dir is not None: 
        
               if cache_file_name is None: 
        
                   # we create a unique hash from the function, 
        
                   # current dataset file and the mapping args 
        
                   cache_file_name = self._get_cache_file_path(new_fingerprint) 
        
               if os.path.exists(cache_file_name) and load_from_cache_file: 
        
                   logger.warning("Loading cached processed dataset at %s", cache_file_name) 
        
                   info = self.info.copy() 
        
                   info.features = features 
        
                   dataset = Dataset.from_file( 
        
                       cache_file_name, info=info, split=self.split, in_memory=not self.cache_files 
        
                   ) 
        
                   dataset._cache_dir = ( 
        
                       os.path.dirname(cache_file_name) 
        
                       if dataset._cache_dir is None and self._cache_dir is not None 
        
                       else None 
        
                   ) 
        
                   return dataset

, so I'm just checking whether this is intended (if yes, maybe the docs should mention this) or not?

think we should add the use_caching parameter to every method that has the keep_in_memory (and in_memory 😃) parameter in its signature for better consistency, but I say let's address this in a separate PR. IMO we need one more PR that will deal exclusively with consistency in the caching logic.

mariosasko · 2021-06-08T00:43:16Z

src/datasets/arrow_dataset.py

+                dataset = Dataset.from_file(
+                    cache_file_name, info=info, split=self.split, in_memory=not self.cache_files
+                )
+                dataset._cache_dir = (
+                    os.path.dirname(cache_file_name)
+                    if dataset._cache_dir is None and self._cache_dir is not None
+                    else None
+                )
+                return dataset


@lhoestq I tried to address the first point from your earlier comment, but this change breaks the tests and it's getting late here, so I don't have time to fix this now. Think this is due to BaseDatasetTest._to in test_arrow_dataset.py relying on Dataset.map.

lhoestq · 2021-06-08T18:01:16Z

Hi @mariosasko
We discussed internally and we think that this feature might not be the direction we're doing to take for these reasons:

it goes against our simple definition of caching: on-disk == uses file cache, and in-memory == nothing is written to disk. I think it adds too much complexity just for a minimal flexibility addition
there are a few edge cases which are really confusing:
- map on an in memory dataset with a cache_file_name specified by the user -> should the result be in memory or from disk ?
- it would require a special cache directory just for in memory datasets, since they don’t have a preferred directory for caching
it would break a lot of stuff and would require to rewrite significant parts of the core code and the tests

So in the end we're probably going to close this PR.
Let me know what you think, and thank you anyway for your help on this !

mariosasko · 2021-06-08T18:28:46Z

Hi,

I'm fine with that. I agree this adds too much complexity. Btw, I like the idea of reverting default in-memory for small datasets that led to this PR.

albertvillanova · 2021-06-08T19:06:45Z

Superseded by #2460 (to close issue #2458).

mariosasko added 4 commits May 6, 2021 21:29

Add cache dir for in-memory datasets

dc5232e

Fix

ee311d2

Further improvements

d556c5c

Merge branch 'master' of https://github.com/huggingface/datasets into…

5eb99e2

… add-cache-dir

lhoestq reviewed May 11, 2021

View reviewed changes

mariosasko added 3 commits May 11, 2021 14:14

Make cache_dir instance attribute

e18c7c3

Update save/load from disk

4bd94a8

Comment

70197f1

mariosasko requested a review from lhoestq May 11, 2021 14:10

Preserve cache_dir in concatenated dataset

e432c45

lhoestq reviewed May 12, 2021

View reviewed changes

bhavitvyamalik mentioned this pull request May 21, 2021

datasets 1.6 ignores cache #2387

Closed

mariosasko added 4 commits May 25, 2021 20:21

Address comments

9514b0f

Style

474f02e

Merge branch 'master' of https://github.com/huggingface/datasets into…

faf3573

… add-cache-dir

Fix tests

164bf50

lhoestq mentioned this pull request Jun 7, 2021

Add HF_ prefix to env var MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2409

Merged

lhoestq reviewed Jun 7, 2021

View reviewed changes

mariosasko added 5 commits June 7, 2021 23:20

Add docstring/docs

2730bc1

Revert param rename in Dataset.from_file

2defb8f

Merge branch 'master' of https://github.com/huggingface/datasets into…

e06b3e0

… add-cache-dir

Fix

2f8f9bf

More consistent caching

c231abd

mariosasko commented Jun 8, 2021

View reviewed changes

albertvillanova closed this Jun 8, 2021

mariosasko deleted the add-cache-dir branch June 8, 2021 19:46

Add cache dir for in-memory datasets #2329

Add cache dir for in-memory datasets #2329

Uh oh!

Conversation

mariosasko commented May 6, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

mariosasko commented May 11, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

mymusise commented May 17, 2021

Uh oh!

mariosasko commented May 25, 2021

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq Jun 7, 2021

Choose a reason for hiding this comment

Uh oh!

mariosasko commented Jun 8, 2021

Uh oh!

mariosasko Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Jun 8, 2021

Uh oh!

mariosasko commented Jun 8, 2021

Uh oh!

albertvillanova commented Jun 8, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mariosasko Jun 8, 2021 •

edited

Loading