Fix regex `get_data_files` formatting for base paths #6322

ZachNagengast · 2023-10-19T19:45:10Z

With this pr #6309, it is formatting the entire base path into regex, which results in the undesired formatting error doesn't match the pattern because of the line in glob_pattern_to_regex: .replace("//", "/"):

Input: hf://datasets/...
Output: hf:/datasets/...

This fix will only convert the split_pattern to regex and keep the base_path unchanged.

cc @albertvillanova hopefully this still works with your implementation

albertvillanova

Thanks for the proposed fix, @ZachNagengast.

EDIT:

The reason why I used the the glob_pattern_to_regex in the entire pattern is because otherwise I got an error for Windows local paths: a base_path like 'C:\\Users\\runneradmin... made the function string_to_dict raise re.error: incomplete escape \U at position 2

See: https://github.com/huggingface/datasets/actions/runs/6544904352/job/17772361643

~~We should include a test that includes the case you mention, and find a solution that works for all cases.~~

That issue was fixed once we pass the base_path as POSIX.

Maybe we could add a test that fails in the case you mention.

HuggingFaceDocBuilderDev · 2023-10-20T10:10:58Z

The documentation is not available anymore as the PR was closed or merged.

ZachNagengast · 2023-10-20T18:25:12Z

The reason why I used the the glob_pattern_to_regex in the entire pattern is because otherwise I got an error for Windows local paths: a base_path like 'C:\Users\runneradmin... made the function string_to_dict raise re.error: incomplete escape \U at position 2

What is the expected inputs and outputs for the windows base_path

That issue was fixed once we pass the base_path as POSIX.

I'm not sure what you meant by that, are there still changes needed?

lhoestq · 2023-10-23T13:57:21Z

We took the liberty of continuing this PR to include it in today's patch release :)
I hope you don't mind

albertvillanova · 2023-10-23T14:12:52Z

src/datasets/data_files.py

+            splits: Set[str] = {
+                string_to_dict(xbasename(p), glob_pattern_to_regex(xbasename(split_pattern)))["split"]
+                for p in data_files
+            }


If you are matching just in the basename, then what is the point of having 2 kinds of patterns?

ALL_SPLIT_PATTERNS: data/{split}-[0-9][0-9][0-9][0-9][0-9]-of-[0-9][0-9][0-9][0-9][0-9]*.*

ALL_DEFAULT_PATTERNS: **/*[{sep}/]{keyword}[{sep}/]**

Maybe I'm missing something, but why do we need the former? I would naively say the latter contains the former.

Only ALL_SPLIT_PATTERNS are parsed to infer custom split names.

While the second only detects train/valid/test

OK, and what is the point of the directory data/ in ALL_SPLIT_PATTERNS if we only match the basename?

This is for old push_to_hub to work: they push custom splits using this pattern in the data directory.
New push_to_hub have some YAML to specify the pattern to use, so get_data_patterns isn't called

OK, all clear now. Thanks.

* Fix regex from formatting url base_path * Test test_get_data_patterns from Hub * simply match basename instead * more tests * minor * remove comment --------- Co-authored-by: Albert Villanova del Moral <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

github-actions · 2023-10-23T14:40:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007109 / 0.011353 (-0.004244)	0.004209 / 0.011008 (-0.006799)	0.097401 / 0.038508 (0.058892)	0.079532 / 0.023109 (0.056423)	0.341300 / 0.275898 (0.065402)	0.402165 / 0.323480 (0.078685)	0.005838 / 0.007986 (-0.002148)	0.003310 / 0.004328 (-0.001018)	0.072804 / 0.004250 (0.068553)	0.059418 / 0.037052 (0.022366)	0.339277 / 0.258489 (0.080788)	0.418495 / 0.293841 (0.124654)	0.035975 / 0.128546 (-0.092571)	0.008101 / 0.075646 (-0.067546)	0.339236 / 0.419271 (-0.080035)	0.059326 / 0.043533 (0.015794)	0.326880 / 0.255139 (0.071741)	0.393614 / 0.283200 (0.110414)	0.025830 / 0.141683 (-0.115852)	1.657726 / 1.452155 (0.205571)	1.817250 / 1.492716 (0.324534)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.256015 / 0.018006 (0.238008)	0.482447 / 0.000490 (0.481957)	0.012166 / 0.000200 (0.011966)	0.000343 / 0.000054 (0.000288)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029898 / 0.037411 (-0.007514)	0.088218 / 0.014526 (0.073692)	0.102353 / 0.176557 (-0.074203)	0.165863 / 0.737135 (-0.571272)	0.100342 / 0.296338 (-0.195996)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.429362 / 0.215209 (0.214153)	4.147327 / 2.077655 (2.069672)	2.014653 / 1.504120 (0.510533)	1.824394 / 1.541195 (0.283199)	1.936408 / 1.468490 (0.467917)	0.542960 / 4.584777 (-4.041817)	3.917215 / 3.745712 (0.171503)	3.714825 / 5.269862 (-1.555036)	2.180279 / 4.565676 (-2.385398)	0.057808 / 0.424275 (-0.366467)	0.008426 / 0.007607 (0.000819)	0.472372 / 0.226044 (0.246327)	4.879656 / 2.268929 (2.610728)	2.602729 / 55.444624 (-52.841896)	2.142593 / 6.876477 (-4.733884)	2.206070 / 2.142072 (0.063997)	0.635591 / 4.805227 (-4.169636)	0.140928 / 6.500664 (-6.359736)	0.065119 / 0.075469 (-0.010350)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.455909 / 1.841788 (-0.385879)	20.803592 / 8.074308 (12.729284)	14.788713 / 10.191392 (4.597321)	0.170546 / 0.680424 (-0.509878)	0.021189 / 0.534201 (-0.513012)	0.432368 / 0.579283 (-0.146915)	0.444664 / 0.434364 (0.010300)	0.517744 / 0.540337 (-0.022593)	0.699265 / 1.386936 (-0.687671)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007592 / 0.011353 (-0.003760)	0.004045 / 0.011008 (-0.006964)	0.073434 / 0.038508 (0.034926)	0.076962 / 0.023109 (0.053853)	0.468873 / 0.275898 (0.192975)	0.479968 / 0.323480 (0.156488)	0.006270 / 0.007986 (-0.001716)	0.003652 / 0.004328 (-0.000677)	0.069893 / 0.004250 (0.065643)	0.061902 / 0.037052 (0.024850)	0.443379 / 0.258489 (0.184890)	0.492627 / 0.293841 (0.198786)	0.035967 / 0.128546 (-0.092579)	0.009276 / 0.075646 (-0.066370)	0.083060 / 0.419271 (-0.336212)	0.050870 / 0.043533 (0.007337)	0.438246 / 0.255139 (0.183107)	0.472074 / 0.283200 (0.188874)	0.023724 / 0.141683 (-0.117959)	1.677178 / 1.452155 (0.225023)	1.732273 / 1.492716 (0.239557)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.244693 / 0.018006 (0.226687)	0.470067 / 0.000490 (0.469577)	0.005574 / 0.000200 (0.005374)	0.000105 / 0.000054 (0.000051)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036242 / 0.037411 (-0.001169)	0.099166 / 0.014526 (0.084641)	0.116785 / 0.176557 (-0.059772)	0.174986 / 0.737135 (-0.562149)	0.118130 / 0.296338 (-0.178209)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.475907 / 0.215209 (0.260698)	4.708125 / 2.077655 (2.630470)	2.600855 / 1.504120 (1.096735)	2.446498 / 1.541195 (0.905303)	2.538786 / 1.468490 (1.070296)	0.566787 / 4.584777 (-4.017990)	4.066187 / 3.745712 (0.320475)	3.743632 / 5.269862 (-1.526229)	2.337737 / 4.565676 (-2.227939)	0.068402 / 0.424275 (-0.355873)	0.008674 / 0.007607 (0.001067)	0.593428 / 0.226044 (0.367384)	5.840687 / 2.268929 (3.571759)	3.194937 / 55.444624 (-52.249688)	2.899033 / 6.876477 (-3.977444)	2.977870 / 2.142072 (0.835797)	0.683673 / 4.805227 (-4.121554)	0.154933 / 6.500664 (-6.345731)	0.071619 / 0.075469 (-0.003850)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.501895 / 1.841788 (-0.339893)	21.709792 / 8.074308 (13.635484)	15.679556 / 10.191392 (5.488164)	0.188028 / 0.680424 (-0.492396)	0.022555 / 0.534201 (-0.511646)	0.439840 / 0.579283 (-0.139443)	0.452140 / 0.434364 (0.017776)	0.526421 / 0.540337 (-0.013916)	0.731692 / 1.386936 (-0.655244)

* Fix regex from formatting url base_path * Test test_get_data_patterns from Hub * simply match basename instead * more tests * minor * remove comment --------- Co-authored-by: Albert Villanova del Moral <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

Fix regex from formatting url base_path

a6bd7b4

ZachNagengast changed the title ~~Fix regex get_data_files formatting for url base paths~~ Fix regex get_data_files formatting for base paths Oct 19, 2023

albertvillanova requested changes Oct 20, 2023

View reviewed changes

albertvillanova and others added 6 commits October 23, 2023 13:40

Merge remote-tracking branch 'upstream/main' into fix-6309

aae3360

Test test_get_data_patterns from Hub

d09a97d

simply match basename instead

c57d2c0

more tests

530f06b

minor

6d8e515

remove comment

db5f817

Merge branch 'main' into fix-6309

2ffbfb7

albertvillanova reviewed Oct 23, 2023

View reviewed changes

albertvillanova approved these changes Oct 23, 2023

View reviewed changes

lhoestq merged commit 02ecc84 into huggingface:main Oct 23, 2023

Fix regex get_data_files formatting for base paths #6322

Fix regex get_data_files formatting for base paths #6322

Uh oh!

Conversation

ZachNagengast commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZachNagengast commented Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhoestq commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

albertvillanova Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 23, 2023

Choose a reason for hiding this comment

Uh oh!

albertvillanova Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 23, 2023

Choose a reason for hiding this comment

Uh oh!

albertvillanova Oct 23, 2023

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix regex `get_data_files` formatting for base paths #6322

Fix regex `get_data_files` formatting for base paths #6322

ZachNagengast commented Oct 19, 2023 •

edited

Loading

albertvillanova left a comment •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 20, 2023 •

edited

Loading

ZachNagengast commented Oct 20, 2023 •

edited

Loading

lhoestq commented Oct 23, 2023 •

edited

Loading

albertvillanova Oct 23, 2023 •

edited

Loading

albertvillanova Oct 23, 2023 •

edited

Loading