Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Dec 12, 2023

Includes both #6458 and #6459

This PR should be merged instead of the two individually, since they are conflicting

Offline cache reload

it can reload datasets that were pushed to hub if they exist in the cache.

example:

>>> Dataset.from_dict({"a": [1, 2]}).push_to_hub("lhoestq/tmp")
>>> load_dataset("lhoestq/tmp")
DatasetDict({
    train: Dataset({
        features: ['a'],
        num_rows: 2
    })
})

and later, without connection:

>>> load_dataset("lhoestq/tmp")
Using the latest cached version of the dataset since lhoestq/tmp couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/quentinlhoest/.cache/huggingface/datasets/lhoestq___tmp/default/0.0.0/da0e902a945afeb9 (last modified on Wed Dec 13 14:55:52 2023).
DatasetDict({
    train: Dataset({
        features: ['a'],
        num_rows: 2
    })
})
  • Updated CachedDatasetModuleFactory to look for datasets in the cache at <namespace>___<dataset_name>/<config_id>
  • Since the metadata configs parameters are not available in offline mode, we don't know which folder to load (config_id and hash change), so I simply load the latest one
    • I instantiate a BuilderConfig even if there is no metadata config with the right config_name
    • Its config_id is equal to the config_name to be able to retrieve it in the cache (no more suffix for configs from metadata configs)
    • We can reload this config if offline mode by specifying the right config_name (same as online !)
  • Consequences of this change:
    • Only when there are user's parameters it creates a custom builder config with config_id = config_name + user parameters hash
    • the hash used to name the cache folder takes into account the metadata config and the dataset info, so that the right cache can be reloaded when there is internet connection without redownloading the data or resolving the data files. For local directories I hash the builder configs and dataset info, and for datasets on the hub I use the commit sha as hash.
    • cache directories now look like config/version/commit_sha for hub datasets which is clean :)

Fix #3547

Lazy data files resolution

this makes this code run in 2sec instead of >10sec

from datasets import load_dataset

ds = load_dataset("glue", "sst2", streaming=True, trust_remote_code=False)

For some datasets with many configs and files it can be up to 100x faster.
This is particularly important now that some datasets will be loaded from the Parquet export instead of the scripts.

The data files are only resolved in the builder __init__. To do so I added DataFilesPatternsList and DataFilesPatternsDict that have .resolve() to return resolved DataFilesList and DataFilesDict

@lhoestq lhoestq changed the title Lazy resolve and cache reload Lazy data fiels resolution and offline cache reload Dec 13, 2023
@lhoestq lhoestq changed the title Lazy data fiels resolution and offline cache reload Lazy data files resolution and offline cache reload Dec 13, 2023
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the enhancements.

Huge PR to review... I guess you tested all the edge cases.

Naive question: is there any breaking change when loading?

@lhoestq
Copy link
Member Author

lhoestq commented Dec 13, 2023

Naive question: is there any breaking change when loading?

No breaking changes except that the cache folders are different

e.g. for glue sst2 (has parquet export)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/glue/sst2/1.0.0/fd8e86499fa5c264fcaad392a8f49ddf58bf4037
On main
~/.cache/huggingface/datasets/glue/sst2/0.0.0/74a75637ac4acd3f
On 2.15.0
~/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad

e.g. for wikimedia/wikipedia 20231101.ab (has metadata configs)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/4cb9b0d719291f1a10f96f67d609c5d442980dc9
On main (takes ages to load)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/cfa627e27933df13
On 2.15.0 (takes ages to load)
~/.cache/huggingface/datasets/wikimedia___wikipedia/20231101.ab/0.0.0/e92ee7a91c466564

e.g. for lhoestq/demo1 (no metadata configs)

This branch (new format is config/version/commit_sha)
~/.cache/huggingface/datasets/lhoestq___demo1/default/0.0.0/87ecf163bedca9d80598b528940a9c4f99e14c11
On main
~/.cache/huggingface/datasets/lhoestq___demo1/default-8a4a0b7a240d3c5e/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d
On 2.15.0
~/.cache/huggingface/datasets/lhoestq___demo1/default-59d4029e0bb36ae0/0.0.0/eea64c71ca8b46dd3f537ed218fc9bf495d5707789152eb2764f5c78fa66d59d

@lhoestq
Copy link
Member Author

lhoestq commented Dec 13, 2023

There was a last bug I just fixed: if you modify a dataset and reload it from the hub it won't download the new version - I think I need to use another hash to name the cache directory
edit: fixed

Comment on lines 2242 to 2244
if builder_config and builder_config.data_files is not None:
builder_config._resolve_data_files(base_path=builder_kwargs["base_path"], download_config=download_config)
hash = update_hash_for_cache(hash, data_files=builder_config.data_files)
Copy link
Member Author

@lhoestq lhoestq Dec 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed the last bug (and added a test)

The hash is now the git commit sha of the dataset on the hub :)

And for local files it's a hash of the builder configs, which takes into account resolved local files (and their last modified dates)

@lhoestq lhoestq force-pushed the lazy-resolve-and-cache-reload branch from 941a6bc to 5c9a6a2 Compare December 13, 2023 15:24
@lhoestq
Copy link
Member Author

lhoestq commented Dec 13, 2023

I switched to using the git commit sha for the cache directory, which is now config/version/commit_sha :) much cleaner than before.

And for local file it's a hash that takes into account the resolved files (and their last modified dates)

@lhoestq
Copy link
Member Author

lhoestq commented Dec 15, 2023

I also ran the transformers CI on this branch and it's green

@lhoestq
Copy link
Member Author

lhoestq commented Dec 15, 2023

FYI huggingface_hub will have a release on tuesday/wednesday (will speed up load_dataset data files resolution which is now needed for datasets loaded from parquet export) so we can aim on merging this around the same time and do a release on thursday

Copy link
Contributor

@polinaeterna polinaeterna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i love the clearer cache structure!
i wonder if it's possible to make loading of Hub's datasets backward-compatible (loading from old cache without commit sha if it exists).
apart from that looks good to me :)

if os.path.isdir(cached_directory_path)
]
if not config_name and len(other_configs) > 1:
raise ValueError(
Copy link
Contributor

@polinaeterna polinaeterna Dec 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i think we usually use textwrap.dedent for multiline error messages like this for readability (maybe in some other places in this pr too)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just checked and we mostly use \n actually - not a big deal imo

@lhoestq
Copy link
Member Author

lhoestq commented Dec 21, 2023

Merging this one, and hopefully the cache backward compatibility PR soon too :)

Then it will be release time

@lhoestq lhoestq merged commit ef3b5dd into main Dec 21, 2023
@lhoestq lhoestq deleted the lazy-resolve-and-cache-reload branch December 21, 2023 15:13
@github-actions
Copy link

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005444 / 0.011353 (-0.005909) 0.003562 / 0.011008 (-0.007446) 0.063183 / 0.038508 (0.024675) 0.048885 / 0.023109 (0.025776) 0.248422 / 0.275898 (-0.027476) 0.277844 / 0.323480 (-0.045636) 0.003019 / 0.007986 (-0.004966) 0.002660 / 0.004328 (-0.001669) 0.048928 / 0.004250 (0.044677) 0.044850 / 0.037052 (0.007798) 0.248505 / 0.258489 (-0.009984) 0.282231 / 0.293841 (-0.011610) 0.028302 / 0.128546 (-0.100244) 0.010829 / 0.075646 (-0.064818) 0.206738 / 0.419271 (-0.212533) 0.035485 / 0.043533 (-0.008048) 0.244575 / 0.255139 (-0.010564) 0.281411 / 0.283200 (-0.001789) 0.019563 / 0.141683 (-0.122120) 1.113769 / 1.452155 (-0.338386) 1.176831 / 1.492716 (-0.315885)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.004718 / 0.018006 (-0.013288) 0.304103 / 0.000490 (0.303614) 0.000214 / 0.000200 (0.000014) 0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.019642 / 0.037411 (-0.017769) 0.060275 / 0.014526 (0.045749) 0.073072 / 0.176557 (-0.103484) 0.119789 / 0.737135 (-0.617346) 0.074535 / 0.296338 (-0.221804)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.278799 / 0.215209 (0.063590) 2.725320 / 2.077655 (0.647665) 1.419048 / 1.504120 (-0.085071) 1.335041 / 1.541195 (-0.206154) 1.373029 / 1.468490 (-0.095461) 0.566774 / 4.584777 (-4.018003) 2.383796 / 3.745712 (-1.361916) 2.734804 / 5.269862 (-2.535057) 1.712277 / 4.565676 (-2.853399) 0.062119 / 0.424275 (-0.362156) 0.004949 / 0.007607 (-0.002658) 0.336126 / 0.226044 (0.110082) 3.298602 / 2.268929 (1.029674) 1.842815 / 55.444624 (-53.601809) 1.544028 / 6.876477 (-5.332449) 1.566717 / 2.142072 (-0.575355) 0.643006 / 4.805227 (-4.162221) 0.118241 / 6.500664 (-6.382423) 0.042453 / 0.075469 (-0.033016)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.949015 / 1.841788 (-0.892773) 11.717958 / 8.074308 (3.643649) 10.482448 / 10.191392 (0.291056) 0.128564 / 0.680424 (-0.551860) 0.014792 / 0.534201 (-0.519408) 0.288636 / 0.579283 (-0.290647) 0.263345 / 0.434364 (-0.171019) 0.325753 / 0.540337 (-0.214584) 0.421294 / 1.386936 (-0.965642)
PyArrow==latest
Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric read_batch_formatted_as_numpy after write_array2d read_batch_formatted_as_numpy after write_flattened_sequence read_batch_formatted_as_numpy after write_nested_sequence read_batch_unformated after write_array2d read_batch_unformated after write_flattened_sequence read_batch_unformated after write_nested_sequence read_col_formatted_as_numpy after write_array2d read_col_formatted_as_numpy after write_flattened_sequence read_col_formatted_as_numpy after write_nested_sequence read_col_unformated after write_array2d read_col_unformated after write_flattened_sequence read_col_unformated after write_nested_sequence read_formatted_as_numpy after write_array2d read_formatted_as_numpy after write_flattened_sequence read_formatted_as_numpy after write_nested_sequence read_unformated after write_array2d read_unformated after write_flattened_sequence read_unformated after write_nested_sequence write_array2d write_flattened_sequence write_nested_sequence
new / old (diff) 0.005367 / 0.011353 (-0.005985) 0.003802 / 0.011008 (-0.007206) 0.049322 / 0.038508 (0.010814) 0.055201 / 0.023109 (0.032092) 0.287811 / 0.275898 (0.011913) 0.305141 / 0.323480 (-0.018339) 0.004095 / 0.007986 (-0.003890) 0.002733 / 0.004328 (-0.001595) 0.049508 / 0.004250 (0.045258) 0.039199 / 0.037052 (0.002147) 0.282719 / 0.258489 (0.024230) 0.311156 / 0.293841 (0.017315) 0.029469 / 0.128546 (-0.099077) 0.010709 / 0.075646 (-0.064937) 0.057646 / 0.419271 (-0.361626) 0.032696 / 0.043533 (-0.010837) 0.285087 / 0.255139 (0.029948) 0.294142 / 0.283200 (0.010942) 0.019779 / 0.141683 (-0.121904) 1.176844 / 1.452155 (-0.275310) 1.190925 / 1.492716 (-0.301792)

Benchmark: benchmark_getitem_100B.json

metric get_batch_of_1024_random_rows get_batch_of_1024_rows get_first_row get_last_row
new / old (diff) 0.092885 / 0.018006 (0.074879) 0.301129 / 0.000490 (0.300640) 0.000232 / 0.000200 (0.000032) 0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric select shard shuffle sort train_test_split
new / old (diff) 0.023202 / 0.037411 (-0.014210) 0.076850 / 0.014526 (0.062325) 0.090058 / 0.176557 (-0.086499) 0.128091 / 0.737135 (-0.609045) 0.091098 / 0.296338 (-0.205240)

Benchmark: benchmark_iterating.json

metric read 5000 read 50000 read_batch 50000 10 read_batch 50000 100 read_batch 50000 1000 read_formatted numpy 5000 read_formatted pandas 5000 read_formatted tensorflow 5000 read_formatted torch 5000 read_formatted_batch numpy 5000 10 read_formatted_batch numpy 5000 1000 shuffled read 5000 shuffled read 50000 shuffled read_batch 50000 10 shuffled read_batch 50000 100 shuffled read_batch 50000 1000 shuffled read_formatted numpy 5000 shuffled read_formatted_batch numpy 5000 10 shuffled read_formatted_batch numpy 5000 1000
new / old (diff) 0.292973 / 0.215209 (0.077764) 2.876022 / 2.077655 (0.798367) 1.672115 / 1.504120 (0.167995) 1.555103 / 1.541195 (0.013909) 1.559832 / 1.468490 (0.091342) 0.558017 / 4.584777 (-4.026760) 2.428448 / 3.745712 (-1.317264) 2.812024 / 5.269862 (-2.457837) 1.738470 / 4.565676 (-2.827207) 0.062669 / 0.424275 (-0.361607) 0.005071 / 0.007607 (-0.002536) 0.351804 / 0.226044 (0.125759) 3.412207 / 2.268929 (1.143279) 2.023478 / 55.444624 (-53.421147) 1.761281 / 6.876477 (-5.115195) 1.770789 / 2.142072 (-0.371283) 0.643062 / 4.805227 (-4.162165) 0.116616 / 6.500664 (-6.384048) 0.041816 / 0.075469 (-0.033653)

Benchmark: benchmark_map_filter.json

metric filter map fast-tokenizer batched map identity map identity batched map no-op batched map no-op batched numpy map no-op batched pandas map no-op batched pytorch map no-op batched tensorflow
new / old (diff) 0.988430 / 1.841788 (-0.853357) 12.278636 / 8.074308 (4.204328) 11.066185 / 10.191392 (0.874793) 0.141191 / 0.680424 (-0.539233) 0.015547 / 0.534201 (-0.518654) 0.288045 / 0.579283 (-0.291238) 0.279651 / 0.434364 (-0.154713) 0.329869 / 0.540337 (-0.210469) 0.420391 / 1.386936 (-0.966545)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Datasets created with push_to_hub can't be accessed in offline mode

5 participants