Lazy data files resolution and offline cache reload #6493

lhoestq · 2023-12-12T17:15:17Z

Includes both #6458 and #6459

This PR should be merged instead of the two individually, since they are conflicting

Offline cache reload

it can reload datasets that were pushed to hub if they exist in the cache.

example:

>>> Dataset.from_dict({"a": [1, 2]}).push_to_hub("lhoestq/tmp")
>>> load_dataset("lhoestq/tmp")
DatasetDict({
    train: Dataset({
        features: ['a'],
        num_rows: 2
    })
})

and later, without connection:

>>> load_dataset("lhoestq/tmp")
Using the latest cached version of the dataset since lhoestq/tmp couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/quentinlhoest/.cache/huggingface/datasets/lhoestq___tmp/default/0.0.0/da0e902a945afeb9 (last modified on Wed Dec 13 14:55:52 2023).
DatasetDict({
    train: Dataset({
        features: ['a'],
        num_rows: 2
    })
})

Updated CachedDatasetModuleFactory to look for datasets in the cache at <namespace>___<dataset_name>/<config_id>
Since the metadata configs parameters are not available in offline mode, we don't know which folder to load (config_id and hash change), so I simply load the latest one
- I instantiate a BuilderConfig even if there is no metadata config with the right config_name
- Its config_id is equal to the config_name to be able to retrieve it in the cache (no more suffix for configs from metadata configs)
- We can reload this config if offline mode by specifying the right config_name (same as online !)
Consequences of this change:
- Only when there are user's parameters it creates a custom builder config with config_id = config_name + user parameters hash
- the hash used to name the cache folder takes into account the metadata config and the dataset info, so that the right cache can be reloaded when there is internet connection without redownloading the data or resolving the data files. For local directories I hash the builder configs and dataset info, and for datasets on the hub I use the commit sha as hash.
- cache directories now look like config/version/commit_sha for hub datasets which is clean :)

Fix #3547

Lazy data files resolution

this makes this code run in 2sec instead of >10sec

from datasets import load_dataset

ds = load_dataset("glue", "sst2", streaming=True, trust_remote_code=False)

For some datasets with many configs and files it can be up to 100x faster.
This is particularly important now that some datasets will be loaded from the Parquet export instead of the scripts.

The data files are only resolved in the builder __init__. To do so I added DataFilesPatternsList and DataFilesPatternsDict that have .resolve() to return resolved DataFilesList and DataFilesDict

albertvillanova

Thanks for the enhancements.

Huge PR to review... I guess you tested all the edge cases.

Naive question: is there any breaking change when loading?

lhoestq · 2023-12-13T14:34:27Z

Naive question: is there any breaking change when loading?

No breaking changes except that the cache folders are different

e.g. for glue sst2 (has parquet export)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/glue/sst2/1.0.0/fd8e86499fa5c264fcaad392a8f49ddf58bf4037
On main
~/.cache/huggingface/datasets/glue/sst2/0.0.0/74a75637ac4acd3f
On 2.15.0
~/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad

e.g. for wikimedia/wikipedia 20231101.ab (has metadata configs)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/4cb9b0d719291f1a10f96f67d609c5d442980dc9
On main (takes ages to load)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/cfa627e27933df13
On 2.15.0 (takes ages to load)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/e92ee7a91c466564

e.g. for lhoestq/demo1 (no metadata configs)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/lhoestq___demo1/default/0.0.0/87ecf163bedca9d80598b528940a9c4f99e14c11
On main
~/.cache/huggingface/datasets/lhoestq___demo1/default-8a4a0b7a240d3c5e/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d
On 2.15.0
~/.cache/huggingface/datasets/lhoestq___demo1/default-59d4029e0bb36ae0/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d

lhoestq · 2023-12-13T14:48:38Z

There was a last bug I just fixed: if you modify a dataset and reload it from the hub it won't download the new version - I think I need to use another hash to name the cache directory
edit: fixed

lhoestq · 2023-12-13T15:14:20Z

src/datasets/load.py

+    if builder_config and builder_config.data_files is not None:
+        builder_config._resolve_data_files(base_path=builder_kwargs["base_path"], download_config=download_config)
+        hash = update_hash_for_cache(hash, data_files=builder_config.data_files)


I fixed the last bug (and added a test)

The hash is now the git commit sha of the dataset on the hub :)

And for local files it's a hash of the builder configs, which takes into account resolved local files (and their last modified dates)

lhoestq · 2023-12-13T19:52:20Z

I switched to using the git commit sha for the cache directory, which is now config/version/commit_sha :) much cleaner than before.

And for local file it's a hash that takes into account the resolved files (and their last modified dates)

lhoestq · 2023-12-15T18:14:47Z

I also ran the transformers CI on this branch and it's green

lhoestq · 2023-12-15T18:16:17Z

FYI huggingface_hub will have a release on tuesday/wednesday (will speed up load_dataset data files resolution which is now needed for datasets loaded from parquet export) so we can aim on merging this around the same time and do a release on thursday

polinaeterna

i love the clearer cache structure!
i wonder if it's possible to make loading of Hub's datasets backward-compatible (loading from old cache without commit sha if it exists).
apart from that looks good to me :)

polinaeterna · 2023-12-19T16:21:55Z

src/datasets/packaged_modules/cache/cache.py

+        if os.path.isdir(cached_directory_path)
+    ]
+    if not config_name and len(other_configs) > 1:
+        raise ValueError(


nit: i think we usually use textwrap.dedent for multiline error messages like this for readability (maybe in some other places in this pr too)

I just checked and we mostly use \n actually - not a big deal imo

lhoestq · 2023-12-21T15:13:07Z

Merging this one, and hopefully the cache backward compatibility PR soon too :)

Then it will be release time

github-actions · 2023-12-21T15:19:19Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005444 / 0.011353 (-0.005909)	0.003562 / 0.011008 (-0.007446)	0.063183 / 0.038508 (0.024675)	0.048885 / 0.023109 (0.025776)	0.248422 / 0.275898 (-0.027476)	0.277844 / 0.323480 (-0.045636)	0.003019 / 0.007986 (-0.004966)	0.002660 / 0.004328 (-0.001669)	0.048928 / 0.004250 (0.044677)	0.044850 / 0.037052 (0.007798)	0.248505 / 0.258489 (-0.009984)	0.282231 / 0.293841 (-0.011610)	0.028302 / 0.128546 (-0.100244)	0.010829 / 0.075646 (-0.064818)	0.206738 / 0.419271 (-0.212533)	0.035485 / 0.043533 (-0.008048)	0.244575 / 0.255139 (-0.010564)	0.281411 / 0.283200 (-0.001789)	0.019563 / 0.141683 (-0.122120)	1.113769 / 1.452155 (-0.338386)	1.176831 / 1.492716 (-0.315885)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.004718 / 0.018006 (-0.013288)	0.304103 / 0.000490 (0.303614)	0.000214 / 0.000200 (0.000014)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019642 / 0.037411 (-0.017769)	0.060275 / 0.014526 (0.045749)	0.073072 / 0.176557 (-0.103484)	0.119789 / 0.737135 (-0.617346)	0.074535 / 0.296338 (-0.221804)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.278799 / 0.215209 (0.063590)	2.725320 / 2.077655 (0.647665)	1.419048 / 1.504120 (-0.085071)	1.335041 / 1.541195 (-0.206154)	1.373029 / 1.468490 (-0.095461)	0.566774 / 4.584777 (-4.018003)	2.383796 / 3.745712 (-1.361916)	2.734804 / 5.269862 (-2.535057)	1.712277 / 4.565676 (-2.853399)	0.062119 / 0.424275 (-0.362156)	0.004949 / 0.007607 (-0.002658)	0.336126 / 0.226044 (0.110082)	3.298602 / 2.268929 (1.029674)	1.842815 / 55.444624 (-53.601809)	1.544028 / 6.876477 (-5.332449)	1.566717 / 2.142072 (-0.575355)	0.643006 / 4.805227 (-4.162221)	0.118241 / 6.500664 (-6.382423)	0.042453 / 0.075469 (-0.033016)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.949015 / 1.841788 (-0.892773)	11.717958 / 8.074308 (3.643649)	10.482448 / 10.191392 (0.291056)	0.128564 / 0.680424 (-0.551860)	0.014792 / 0.534201 (-0.519408)	0.288636 / 0.579283 (-0.290647)	0.263345 / 0.434364 (-0.171019)	0.325753 / 0.540337 (-0.214584)	0.421294 / 1.386936 (-0.965642)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005367 / 0.011353 (-0.005985)	0.003802 / 0.011008 (-0.007206)	0.049322 / 0.038508 (0.010814)	0.055201 / 0.023109 (0.032092)	0.287811 / 0.275898 (0.011913)	0.305141 / 0.323480 (-0.018339)	0.004095 / 0.007986 (-0.003890)	0.002733 / 0.004328 (-0.001595)	0.049508 / 0.004250 (0.045258)	0.039199 / 0.037052 (0.002147)	0.282719 / 0.258489 (0.024230)	0.311156 / 0.293841 (0.017315)	0.029469 / 0.128546 (-0.099077)	0.010709 / 0.075646 (-0.064937)	0.057646 / 0.419271 (-0.361626)	0.032696 / 0.043533 (-0.010837)	0.285087 / 0.255139 (0.029948)	0.294142 / 0.283200 (0.010942)	0.019779 / 0.141683 (-0.121904)	1.176844 / 1.452155 (-0.275310)	1.190925 / 1.492716 (-0.301792)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092885 / 0.018006 (0.074879)	0.301129 / 0.000490 (0.300640)	0.000232 / 0.000200 (0.000032)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023202 / 0.037411 (-0.014210)	0.076850 / 0.014526 (0.062325)	0.090058 / 0.176557 (-0.086499)	0.128091 / 0.737135 (-0.609045)	0.091098 / 0.296338 (-0.205240)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.292973 / 0.215209 (0.077764)	2.876022 / 2.077655 (0.798367)	1.672115 / 1.504120 (0.167995)	1.555103 / 1.541195 (0.013909)	1.559832 / 1.468490 (0.091342)	0.558017 / 4.584777 (-4.026760)	2.428448 / 3.745712 (-1.317264)	2.812024 / 5.269862 (-2.457837)	1.738470 / 4.565676 (-2.827207)	0.062669 / 0.424275 (-0.361607)	0.005071 / 0.007607 (-0.002536)	0.351804 / 0.226044 (0.125759)	3.412207 / 2.268929 (1.143279)	2.023478 / 55.444624 (-53.421147)	1.761281 / 6.876477 (-5.115195)	1.770789 / 2.142072 (-0.371283)	0.643062 / 4.805227 (-4.162165)	0.116616 / 6.500664 (-6.384048)	0.041816 / 0.075469 (-0.033653)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.988430 / 1.841788 (-0.853357)	12.278636 / 8.074308 (4.204328)	11.066185 / 10.191392 (0.874793)	0.141191 / 0.680424 (-0.539233)	0.015547 / 0.534201 (-0.518654)	0.288045 / 0.579283 (-0.291238)	0.279651 / 0.434364 (-0.154713)	0.329869 / 0.540337 (-0.210469)	0.420391 / 1.386936 (-0.966545)

lhoestq and others added 30 commits November 29, 2023 14:16

lazy data files resolution

51002cb

fix tests

5a5bb38

minor

214a3e6

don't use expand_info=False yet

b7a9674

fix

32e0960

retrieve cached datasets that were pushed to hub

d3c0694

minor

8214ff2

style

68099ca

Merge branch 'main' into lazy-data_files_resolution

5dd4698

Merge branch 'main' into lazy-data_files_resolution

cf86d48

tests

b3fc428

fix win test

796a47e

Merge branch 'main' into lazy-data_files_resolution

21209d2

fix tests

a924e94

Merge branch 'main' into lazy-data_files_resolution

adc07dd

fix tests again

c9ecfca

remove unused code

ddb488b

Merge branch 'main' into retrieve-cached-no-script-datasets

50cec07

allow load from cache in streaming mode

a15ea9d

remove comment

809b5cc

more tests

90f3d34

fix tests

1e3074a

fix more tests

5cc79dd

fix tests

ba892e1

fix tests

b9d3b0a

fix cache on config change

565c294

simpler

e01938d

fix tests

09516db

Merge branch 'retrieve-cached-no-script-datasets' into test-trm

c77fb22

Merge branch 'lazy-data_files_resolution' into test-trm

056b34b

lhoestq mentioned this pull request Dec 13, 2023

Retrieve cached datasets that were pushed to hub when offline #6459

Closed

2 tasks

lhoestq changed the title ~~Lazy resolve and cache reload~~ Lazy data fiels resolution and offline cache reload Dec 13, 2023

lhoestq changed the title ~~Lazy data fiels resolution and offline cache reload~~ Lazy data files resolution and offline cache reload Dec 13, 2023

update hash when loading from parquet export too

4608447

albertvillanova reviewed Dec 13, 2023

View reviewed changes

fix modify files

c3ddbd0

lhoestq commented Dec 13, 2023

View reviewed changes

fix base_path

5c9a6a2

lhoestq force-pushed the lazy-resolve-and-cache-reload branch from 941a6bc to 5c9a6a2 Compare December 13, 2023 15:24

lhoestq added 2 commits December 13, 2023 20:43

just use the commit sha as hash

579c785

use commit sha in parquet export dataset cache directories too

81d161f

lhoestq added 2 commits December 14, 2023 13:00

use version from parquet export dataset info

e776c88

Merge branch 'main' into lazy-resolve-and-cache-reload

d430de8

lhoestq and others added 2 commits December 19, 2023 11:31

Merge branch 'main' into lazy-resolve-and-cache-reload

6e4df3d

fix cache reload when config name and version are not the default ones

2b88ad4

polinaeterna reviewed Dec 19, 2023

View reviewed changes

fix tests

0762ddb

lhoestq mentioned this pull request Dec 19, 2023

Cache backward compatibility with 2.15.0 #6514

Merged

lhoestq merged commit ef3b5dd into main Dec 21, 2023

lhoestq deleted the lazy-resolve-and-cache-reload branch December 21, 2023 15:13

rayliuca mentioned this pull request Dec 26, 2023

ted_talks_iwslt | Error: Config name is missing #6533

Closed

lhoestq mentioned this pull request Jan 23, 2024

Issue with offline mode #4760

Closed

albertvillanova mentioned this pull request May 21, 2024

Fix wrong type hints in data_files #6910

Merged

Lazy data files resolution and offline cache reload #6493

Lazy data files resolution and offline cache reload #6493

Uh oh!

Conversation

lhoestq commented Dec 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Offline cache reload

Lazy data files resolution

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq Dec 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Dec 13, 2023

Uh oh!

lhoestq commented Dec 15, 2023

Uh oh!

lhoestq commented Dec 15, 2023

Uh oh!

polinaeterna left a comment

Choose a reason for hiding this comment

Uh oh!

polinaeterna Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

lhoestq commented Dec 21, 2023

Uh oh!

github-actions bot commented Dec 21, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lhoestq commented Dec 12, 2023 •

edited

Loading

lhoestq commented Dec 13, 2023 •

edited

Loading

lhoestq commented Dec 13, 2023 •

edited

Loading

lhoestq Dec 13, 2023 •

edited

Loading

polinaeterna Dec 19, 2023 •

edited

Loading