-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Add cache dir for in-memory datasets #2329
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! That's a really good start :)
Maybe we can have _cache_dir as an attribute of the Dataset object itself instead of storing it in the info ?
The info are meant to be shared (for example we store them in the dataset_infos.json file of each dataset). Because of that I don't think we should end up having users paths in them.
|
Yes, having |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks ! Looks good :)
I think the next step is to write tests... x)
Since in-memory dataset never had a way to fetch results from the cache in map for example, I think we might have to tweak a few things. For example I'm pretty sure that this could happen (as the code is right now on this PR):
- you load a dataset with
keep_in_memory=True - you apply map -> it stores the result in a cache file BUT you end up with a dataset that is not in-memory since the table is loaded via
Dataset.from_filewith the defaultin_memoryparameter here:
datasets/src/datasets/arrow_dataset.py
Line 1925 in 6c5742c
| return Dataset.from_file(cache_file_name, info=info, split=self.split) |
Another one:
- you load a dataset with
keep_in_memory=False - you apply map -> it stores the result in a cache file
- you reload the dataset but this time with
keep_in_memory=True - you re-apply map with the same function -> it reloads from the cache BUT you end up with a dataset that is not in-memory since the table is loaded via
Dataset.from_filewith the defaultin_memoryparameter here:
datasets/src/datasets/arrow_dataset.py
Line 1739 in 6c5742c
| return Dataset.from_file(cache_file_name, info=info, split=self.split) |
Finally, I wonder if we should add a use_caching parameter to load_dataset to give users the possibility to set caching independently of keep_in_memory. If use_caching is False, then _cache_dir would be None, independently of the value of keep_in_memory.
Let me know what you think !
I'm glad we finally get a chance to make the user experience better regarding caching. I feel like it wasn't ideal to have caching linked to keep_in_memory (in-memory dataset had no caching...)
|
Good job! Looking forward to this new feature! 🥂 |
|
@lhoestq Sorry for the late reply. Yes, I'll start working on tests. Thanks for the detailed explanation of the current issues with caching (like the idea of adding the |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to merge this PR this week :)
It just misses some tests to make sure that the _cached_dir attribute is used as expected.
I can take care of the tests tomorrow once I'm done with the dataset streaming documentation.
Also let's revert the renaming of the in_memory parameter
src/datasets/arrow_dataset.py
Outdated
| split: Optional[NamedSplit] = None, | ||
| indices_filename: Optional[str] = None, | ||
| in_memory: bool = False, | ||
| keep_in_memory: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a breaking change. I agree that keep_in_memory is nice for consistency but we try to avoid breaking changes as much as possible
|
@lhoestq Sure. I'm aware this is a high-priority issue to some extent, so feel free to take over. Few suggestions I have:
|
| dataset = Dataset.from_file( | ||
| cache_file_name, info=info, split=self.split, in_memory=not self.cache_files | ||
| ) | ||
| dataset._cache_dir = ( | ||
| os.path.dirname(cache_file_name) | ||
| if dataset._cache_dir is None and self._cache_dir is not None | ||
| else None | ||
| ) | ||
| return dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lhoestq I tried to address the first point from your earlier comment, but this change breaks the tests and it's getting late here, so I don't have time to fix this now. Think this is due to BaseDatasetTest._to in test_arrow_dataset.py relying on Dataset.map.
|
Hi @mariosasko
So in the end we're probably going to close this PR. |
|
Hi, I'm fine with that. I agree this adds too much complexity. Btw, I like the idea of reverting default in-memory for small datasets that led to this PR. |
Adds the cache dir attribute to DatasetInfo as suggested by @lhoestq.
Should fix #2322