Skip to content

Conversation

@mariosasko
Copy link
Collaborator

Adds the cache dir attribute to DatasetInfo as suggested by @lhoestq.

Should fix #2322

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! That's a really good start :)

Maybe we can have _cache_dir as an attribute of the Dataset object itself instead of storing it in the info ?
The info are meant to be shared (for example we store them in the dataset_infos.json file of each dataset). Because of that I don't think we should end up having users paths in them.

@mariosasko
Copy link
Collaborator Author

Yes, having cache_dir as an attribute looks cleaner.

@mariosasko mariosasko requested a review from lhoestq May 11, 2021 14:10
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! Looks good :)

I think the next step is to write tests... x)

Since in-memory dataset never had a way to fetch results from the cache in map for example, I think we might have to tweak a few things. For example I'm pretty sure that this could happen (as the code is right now on this PR):

  1. you load a dataset with keep_in_memory=True
  2. you apply map -> it stores the result in a cache file BUT you end up with a dataset that is not in-memory since the table is loaded via Dataset.from_file with the default in_memory parameter here:

return Dataset.from_file(cache_file_name, info=info, split=self.split)

Another one:

  1. you load a dataset with keep_in_memory=False
  2. you apply map -> it stores the result in a cache file
  3. you reload the dataset but this time with keep_in_memory=True
  4. you re-apply map with the same function -> it reloads from the cache BUT you end up with a dataset that is not in-memory since the table is loaded via Dataset.from_file with the default in_memory parameter here:

return Dataset.from_file(cache_file_name, info=info, split=self.split)

Finally, I wonder if we should add a use_caching parameter to load_dataset to give users the possibility to set caching independently of keep_in_memory. If use_caching is False, then _cache_dir would be None, independently of the value of keep_in_memory.

Let me know what you think !

I'm glad we finally get a chance to make the user experience better regarding caching. I feel like it wasn't ideal to have caching linked to keep_in_memory (in-memory dataset had no caching...)

@mymusise
Copy link

Good job! Looking forward to this new feature! 🥂

@mariosasko
Copy link
Collaborator Author

@lhoestq Sorry for the late reply. Yes, I'll start working on tests. Thanks for the detailed explanation of the current issues with caching (like the idea of adding the use_caching parameter to load_dataset)

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to merge this PR this week :)

It just misses some tests to make sure that the _cached_dir attribute is used as expected.
I can take care of the tests tomorrow once I'm done with the dataset streaming documentation.

Also let's revert the renaming of the in_memory parameter

split: Optional[NamedSplit] = None,
indices_filename: Optional[str] = None,
in_memory: bool = False,
keep_in_memory: bool = False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change. I agree that keep_in_memory is nice for consistency but we try to avoid breaking changes as much as possible

@mariosasko
Copy link
Collaborator Author

@lhoestq Sure. I'm aware this is a high-priority issue to some extent, so feel free to take over.

Few suggestions I have:

  • there is a slight difference between setting use_caching to False in load_dataset and disabling caching globally with set_caching_enabled(False) because the former will never execute the following code (self._cache_dir is always False):
    if self._cache_dir is not None:
    if cache_file_name is None:
    # we create a unique hash from the function,
    # current dataset file and the mapping args
    cache_file_name = self._get_cache_file_path(new_fingerprint)
    if os.path.exists(cache_file_name) and load_from_cache_file:
    logger.warning("Loading cached processed dataset at %s", cache_file_name)
    info = self.info.copy()
    info.features = features
    dataset = Dataset.from_file(
    cache_file_name, info=info, split=self.split, in_memory=not self.cache_files
    )
    dataset._cache_dir = (
    os.path.dirname(cache_file_name)
    if dataset._cache_dir is None and self._cache_dir is not None
    else None
    )
    return dataset

    , so I'm just checking whether this is intended (if yes, maybe the docs should mention this) or not?
  • think we should add the use_caching parameter to every method that has the keep_in_memory (and in_memory 😃) parameter in its signature for better consistency, but I say let's address this in a separate PR. IMO we need one more PR that will deal exclusively with consistency in the caching logic.

Comment on lines +2010 to +2018
dataset = Dataset.from_file(
cache_file_name, info=info, split=self.split, in_memory=not self.cache_files
)
dataset._cache_dir = (
os.path.dirname(cache_file_name)
if dataset._cache_dir is None and self._cache_dir is not None
else None
)
return dataset
Copy link
Collaborator Author

@mariosasko mariosasko Jun 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhoestq I tried to address the first point from your earlier comment, but this change breaks the tests and it's getting late here, so I don't have time to fix this now. Think this is due to BaseDatasetTest._to in test_arrow_dataset.py relying on Dataset.map.

@lhoestq
Copy link
Member

lhoestq commented Jun 8, 2021

Hi @mariosasko
We discussed internally and we think that this feature might not be the direction we're doing to take for these reasons:

  • it goes against our simple definition of caching: on-disk == uses file cache, and in-memory == nothing is written to disk. I think it adds too much complexity just for a minimal flexibility addition
  • there are a few edge cases which are really confusing:
    • map on an in memory dataset with a cache_file_name specified by the user -> should the result be in memory or from disk ?
    • it would require a special cache directory just for in memory datasets, since they don’t have a preferred directory for caching
  • it would break a lot of stuff and would require to rewrite significant parts of the core code and the tests

So in the end we're probably going to close this PR.
Let me know what you think, and thank you anyway for your help on this !

@mariosasko
Copy link
Collaborator Author

Hi,

I'm fine with that. I agree this adds too much complexity. Btw, I like the idea of reverting default in-memory for small datasets that led to this PR.

@albertvillanova
Copy link
Member

Superseded by #2460 (to close issue #2458).

@mariosasko mariosasko deleted the add-cache-dir branch June 8, 2021 19:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Calls to map are not cached.

4 participants