-
Notifications
You must be signed in to change notification settings - Fork 3k
Lazy data files resolution and offline cache reload #6493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the enhancements.
Huge PR to review... I guess you tested all the edge cases.
Naive question: is there any breaking change when loading?
No breaking changes except that the cache folders are different e.g. for glue sst2 (has parquet export) e.g. for wikimedia/wikipedia 20231101.ab (has metadata configs) e.g. for lhoestq/demo1 (no metadata configs) |
|
There was a last bug I just fixed: if you modify a dataset and reload it from the hub it won't download the new version - I think I need to use another hash to name the cache directory |
src/datasets/load.py
Outdated
| if builder_config and builder_config.data_files is not None: | ||
| builder_config._resolve_data_files(base_path=builder_kwargs["base_path"], download_config=download_config) | ||
| hash = update_hash_for_cache(hash, data_files=builder_config.data_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the last bug (and added a test)
The hash is now the git commit sha of the dataset on the hub :)
And for local files it's a hash of the builder configs, which takes into account resolved local files (and their last modified dates)
941a6bc to
5c9a6a2
Compare
|
I switched to using the git commit sha for the cache directory, which is now And for local file it's a hash that takes into account the resolved files (and their last modified dates) |
|
I also ran the |
|
FYI |
polinaeterna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i love the clearer cache structure!
i wonder if it's possible to make loading of Hub's datasets backward-compatible (loading from old cache without commit sha if it exists).
apart from that looks good to me :)
| if os.path.isdir(cached_directory_path) | ||
| ] | ||
| if not config_name and len(other_configs) > 1: | ||
| raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: i think we usually use textwrap.dedent for multiline error messages like this for readability (maybe in some other places in this pr too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just checked and we mostly use \n actually - not a big deal imo
|
Merging this one, and hopefully the cache backward compatibility PR soon too :) Then it will be release time |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|

Includes both #6458 and #6459
This PR should be merged instead of the two individually, since they are conflicting
Offline cache reload
it can reload datasets that were pushed to hub if they exist in the cache.
example:
and later, without connection:
CachedDatasetModuleFactoryto look for datasets in the cache at<namespace>___<dataset_name>/<config_id>config/version/commit_shafor hub datasets which is clean :)Fix #3547
Lazy data files resolution
this makes this code run in 2sec instead of >10sec
For some datasets with many configs and files it can be up to 100x faster.
This is particularly important now that some datasets will be loaded from the Parquet export instead of the scripts.
The data files are only resolved in the builder
__init__. To do so I added DataFilesPatternsList and DataFilesPatternsDict that have.resolve()to return resolved DataFilesList and DataFilesDict