Do not filter out .zip extensions from no-script datasets #6208

albertvillanova · 2023-09-04T06:07:12Z

This PR is a hotfix of:

No-script datasets with ZIP files do not load #6207

That PR introduced the filtering out of .zip extensions. This PR reverts that.

Hot fix #6207.

Maybe we should do patch releases: the bug was introduced in 2.13.1.

CC: @lhoestq

github-actions · 2023-09-04T06:13:42Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006797 / 0.011353 (-0.004556)	0.003966 / 0.011008 (-0.007042)	0.085296 / 0.038508 (0.046788)	0.076873 / 0.023109 (0.053764)	0.355795 / 0.275898 (0.079897)	0.397132 / 0.323480 (0.073652)	0.005325 / 0.007986 (-0.002660)	0.003343 / 0.004328 (-0.000986)	0.064966 / 0.004250 (0.060716)	0.054519 / 0.037052 (0.017467)	0.357864 / 0.258489 (0.099374)	0.409238 / 0.293841 (0.115397)	0.031620 / 0.128546 (-0.096926)	0.008529 / 0.075646 (-0.067117)	0.288502 / 0.419271 (-0.130769)	0.053260 / 0.043533 (0.009728)	0.355245 / 0.255139 (0.100106)	0.384139 / 0.283200 (0.100939)	0.024507 / 0.141683 (-0.117176)	1.494696 / 1.452155 (0.042541)	1.579847 / 1.492716 (0.087130)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.204011 / 0.018006 (0.186005)	0.451729 / 0.000490 (0.451239)	0.004628 / 0.000200 (0.004428)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028342 / 0.037411 (-0.009069)	0.084647 / 0.014526 (0.070121)	0.096174 / 0.176557 (-0.080383)	0.151753 / 0.737135 (-0.585382)	0.096347 / 0.296338 (-0.199991)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.387179 / 0.215209 (0.171970)	3.861552 / 2.077655 (1.783898)	1.844033 / 1.504120 (0.339913)	1.678811 / 1.541195 (0.137616)	1.793207 / 1.468490 (0.324717)	0.485836 / 4.584777 (-4.098941)	3.566274 / 3.745712 (-0.179438)	3.269888 / 5.269862 (-1.999974)	2.042850 / 4.565676 (-2.522827)	0.057088 / 0.424275 (-0.367187)	0.007627 / 0.007607 (0.000019)	0.460510 / 0.226044 (0.234465)	4.602019 / 2.268929 (2.333090)	2.390984 / 55.444624 (-53.053641)	1.976150 / 6.876477 (-4.900327)	2.193394 / 2.142072 (0.051322)	0.582775 / 4.805227 (-4.222453)	0.133408 / 6.500664 (-6.367256)	0.060577 / 0.075469 (-0.014893)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.248505 / 1.841788 (-0.593283)	19.771301 / 8.074308 (11.696993)	14.327871 / 10.191392 (4.136479)	0.155288 / 0.680424 (-0.525136)	0.018310 / 0.534201 (-0.515891)	0.393664 / 0.579283 (-0.185619)	0.410578 / 0.434364 (-0.023786)	0.459301 / 0.540337 (-0.081037)	0.631921 / 1.386936 (-0.755015)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006827 / 0.011353 (-0.004526)	0.004094 / 0.011008 (-0.006915)	0.065299 / 0.038508 (0.026791)	0.079496 / 0.023109 (0.056387)	0.403661 / 0.275898 (0.127763)	0.434449 / 0.323480 (0.110969)	0.005398 / 0.007986 (-0.002588)	0.003410 / 0.004328 (-0.000919)	0.064832 / 0.004250 (0.060582)	0.056303 / 0.037052 (0.019250)	0.397848 / 0.258489 (0.139359)	0.438244 / 0.293841 (0.144403)	0.032637 / 0.128546 (-0.095909)	0.008584 / 0.075646 (-0.067063)	0.071406 / 0.419271 (-0.347866)	0.048265 / 0.043533 (0.004732)	0.397814 / 0.255139 (0.142675)	0.421601 / 0.283200 (0.138402)	0.023815 / 0.141683 (-0.117868)	1.504814 / 1.452155 (0.052659)	1.577185 / 1.492716 (0.084469)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.231775 / 0.018006 (0.213769)	0.445437 / 0.000490 (0.444948)	0.005252 / 0.000200 (0.005052)	0.000093 / 0.000054 (0.000039)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032777 / 0.037411 (-0.004634)	0.095054 / 0.014526 (0.080528)	0.106429 / 0.176557 (-0.070127)	0.160111 / 0.737135 (-0.577024)	0.108075 / 0.296338 (-0.188263)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.426034 / 0.215209 (0.210825)	4.244668 / 2.077655 (2.167013)	2.257938 / 1.504120 (0.753818)	2.087993 / 1.541195 (0.546798)	2.170878 / 1.468490 (0.702387)	0.485228 / 4.584777 (-4.099549)	3.725912 / 3.745712 (-0.019800)	3.286925 / 5.269862 (-1.982937)	2.059929 / 4.565676 (-2.505748)	0.057813 / 0.424275 (-0.366462)	0.007518 / 0.007607 (-0.000089)	0.506632 / 0.226044 (0.280588)	5.048340 / 2.268929 (2.779411)	2.744756 / 55.444624 (-52.699869)	2.406636 / 6.876477 (-4.469841)	2.617552 / 2.142072 (0.475480)	0.588476 / 4.805227 (-4.216751)	0.133518 / 6.500664 (-6.367146)	0.060778 / 0.075469 (-0.014691)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.356416 / 1.841788 (-0.485372)	20.467516 / 8.074308 (12.393208)	15.265443 / 10.191392 (5.074051)	0.169201 / 0.680424 (-0.511223)	0.020087 / 0.534201 (-0.514114)	0.402332 / 0.579283 (-0.176951)	0.414848 / 0.434364 (-0.019516)	0.470422 / 0.540337 (-0.069916)	0.647266 / 1.386936 (-0.739670)

HuggingFaceDocBuilderDev · 2023-09-04T06:20:15Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-09-04T06:20:56Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005804 / 0.011353 (-0.005549)	0.003519 / 0.011008 (-0.007489)	0.080003 / 0.038508 (0.041495)	0.055419 / 0.023109 (0.032309)	0.395254 / 0.275898 (0.119356)	0.432714 / 0.323480 (0.109234)	0.004438 / 0.007986 (-0.003548)	0.002832 / 0.004328 (-0.001496)	0.062026 / 0.004250 (0.057775)	0.044334 / 0.037052 (0.007282)	0.401278 / 0.258489 (0.142789)	0.451516 / 0.293841 (0.157675)	0.026791 / 0.128546 (-0.101755)	0.007946 / 0.075646 (-0.067700)	0.265166 / 0.419271 (-0.154106)	0.044119 / 0.043533 (0.000586)	0.399621 / 0.255139 (0.144482)	0.422808 / 0.283200 (0.139609)	0.019998 / 0.141683 (-0.121685)	1.433559 / 1.452155 (-0.018596)	1.596902 / 1.492716 (0.104186)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.195662 / 0.018006 (0.177656)	0.423167 / 0.000490 (0.422677)	0.003426 / 0.000200 (0.003227)	0.000066 / 0.000054 (0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.023318 / 0.037411 (-0.014094)	0.072532 / 0.014526 (0.058006)	0.082181 / 0.176557 (-0.094375)	0.142214 / 0.737135 (-0.594921)	0.083423 / 0.296338 (-0.212915)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.402270 / 0.215209 (0.187061)	4.027607 / 2.077655 (1.949953)	2.059803 / 1.504120 (0.555684)	1.865115 / 1.541195 (0.323920)	1.934976 / 1.468490 (0.466485)	0.502145 / 4.584777 (-4.082632)	2.970865 / 3.745712 (-0.774847)	2.784155 / 5.269862 (-2.485707)	1.822003 / 4.565676 (-2.743673)	0.057699 / 0.424275 (-0.366576)	0.006668 / 0.007607 (-0.000939)	0.471164 / 0.226044 (0.245120)	4.733079 / 2.268929 (2.464150)	2.445119 / 55.444624 (-52.999505)	2.132956 / 6.876477 (-4.743521)	2.335998 / 2.142072 (0.193926)	0.594881 / 4.805227 (-4.210347)	0.125801 / 6.500664 (-6.374863)	0.060780 / 0.075469 (-0.014689)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.233170 / 1.841788 (-0.608618)	17.942205 / 8.074308 (9.867897)	13.587020 / 10.191392 (3.395628)	0.142110 / 0.680424 (-0.538314)	0.016600 / 0.534201 (-0.517601)	0.328659 / 0.579283 (-0.250624)	0.347759 / 0.434364 (-0.086605)	0.378651 / 0.540337 (-0.161687)	0.523474 / 1.386936 (-0.863462)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006028 / 0.011353 (-0.005325)	0.003552 / 0.011008 (-0.007456)	0.062175 / 0.038508 (0.023667)	0.057602 / 0.023109 (0.034493)	0.444585 / 0.275898 (0.168687)	0.471238 / 0.323480 (0.147758)	0.004562 / 0.007986 (-0.003423)	0.002871 / 0.004328 (-0.001457)	0.063101 / 0.004250 (0.058851)	0.046072 / 0.037052 (0.009020)	0.448253 / 0.258489 (0.189764)	0.478734 / 0.293841 (0.184893)	0.028463 / 0.128546 (-0.100084)	0.008090 / 0.075646 (-0.067557)	0.068142 / 0.419271 (-0.351130)	0.040517 / 0.043533 (-0.003016)	0.447145 / 0.255139 (0.192006)	0.469472 / 0.283200 (0.186273)	0.019391 / 0.141683 (-0.122291)	1.471195 / 1.452155 (0.019040)	1.532966 / 1.492716 (0.040249)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.259894 / 0.018006 (0.241888)	0.412987 / 0.000490 (0.412497)	0.020780 / 0.000200 (0.020580)	0.000084 / 0.000054 (0.000030)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026352 / 0.037411 (-0.011060)	0.080024 / 0.014526 (0.065498)	0.088041 / 0.176557 (-0.088516)	0.142987 / 0.737135 (-0.594148)	0.090108 / 0.296338 (-0.206231)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.458874 / 0.215209 (0.243665)	4.573005 / 2.077655 (2.495351)	2.507885 / 1.504120 (1.003765)	2.335432 / 1.541195 (0.794238)	2.379617 / 1.468490 (0.911126)	0.503331 / 4.584777 (-4.081446)	3.078284 / 3.745712 (-0.667428)	2.750580 / 5.269862 (-2.519282)	1.828100 / 4.565676 (-2.737577)	0.057572 / 0.424275 (-0.366703)	0.006553 / 0.007607 (-0.001054)	0.532283 / 0.226044 (0.306239)	5.310584 / 2.268929 (3.041656)	2.943559 / 55.444624 (-52.501065)	2.587544 / 6.876477 (-4.288932)	2.718261 / 2.142072 (0.576188)	0.590267 / 4.805227 (-4.214961)	0.123229 / 6.500664 (-6.377435)	0.060219 / 0.075469 (-0.015250)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.340773 / 1.841788 (-0.501014)	18.420766 / 8.074308 (10.346458)	14.630550 / 10.191392 (4.439158)	0.146666 / 0.680424 (-0.533758)	0.017905 / 0.534201 (-0.516296)	0.332483 / 0.579283 (-0.246801)	0.355490 / 0.434364 (-0.078874)	0.382618 / 0.540337 (-0.157720)	0.531336 / 1.386936 (-0.855600)

albertvillanova · 2023-09-04T07:33:52Z

There were CI errors unrelated to this PR.

github-actions · 2023-09-04T07:42:20Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008702 / 0.011353 (-0.002651)	0.005060 / 0.011008 (-0.005948)	0.097017 / 0.038508 (0.058509)	0.073740 / 0.023109 (0.050631)	0.435138 / 0.275898 (0.159240)	0.512776 / 0.323480 (0.189296)	0.006186 / 0.007986 (-0.001800)	0.003970 / 0.004328 (-0.000358)	0.089523 / 0.004250 (0.085273)	0.054441 / 0.037052 (0.017389)	0.447415 / 0.258489 (0.188926)	0.464851 / 0.293841 (0.171010)	0.050264 / 0.128546 (-0.078283)	0.016643 / 0.075646 (-0.059004)	0.350565 / 0.419271 (-0.068707)	0.071220 / 0.043533 (0.027687)	0.432531 / 0.255139 (0.177392)	0.472994 / 0.283200 (0.189795)	0.040229 / 0.141683 (-0.101454)	1.743431 / 1.452155 (0.291276)	1.778653 / 1.492716 (0.285936)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.261992 / 0.018006 (0.243986)	0.571979 / 0.000490 (0.571489)	0.006270 / 0.000200 (0.006071)	0.000109 / 0.000054 (0.000054)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027821 / 0.037411 (-0.009590)	0.081874 / 0.014526 (0.067348)	0.103725 / 0.176557 (-0.072831)	0.170593 / 0.737135 (-0.566542)	0.108749 / 0.296338 (-0.187590)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.690774 / 0.215209 (0.475565)	6.770902 / 2.077655 (4.693247)	2.887218 / 1.504120 (1.383098)	2.456226 / 1.541195 (0.915032)	2.509422 / 1.468490 (1.040932)	0.768451 / 4.584777 (-3.816326)	4.988933 / 3.745712 (1.243221)	4.151460 / 5.269862 (-1.118402)	2.640472 / 4.565676 (-1.925205)	0.093522 / 0.424275 (-0.330753)	0.008614 / 0.007607 (0.001007)	0.696281 / 0.226044 (0.470237)	6.721077 / 2.268929 (4.452149)	3.229760 / 55.444624 (-52.214864)	2.668521 / 6.876477 (-4.207956)	2.866420 / 2.142072 (0.724347)	0.945328 / 4.805227 (-3.859899)	0.197645 / 6.500664 (-6.303019)	0.074442 / 0.075469 (-0.001027)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.630468 / 1.841788 (-0.211320)	22.991661 / 8.074308 (14.917353)	19.816919 / 10.191392 (9.625527)	0.257410 / 0.680424 (-0.423014)	0.027228 / 0.534201 (-0.506973)	0.444515 / 0.579283 (-0.134768)	0.597067 / 0.434364 (0.162703)	0.528151 / 0.540337 (-0.012186)	0.771276 / 1.386936 (-0.615660)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009154 / 0.011353 (-0.002199)	0.004648 / 0.011008 (-0.006360)	0.073054 / 0.038508 (0.034546)	0.077146 / 0.023109 (0.054037)	0.481659 / 0.275898 (0.205761)	0.516985 / 0.323480 (0.193505)	0.007447 / 0.007986 (-0.000538)	0.003890 / 0.004328 (-0.000438)	0.078701 / 0.004250 (0.074450)	0.059183 / 0.037052 (0.022131)	0.475350 / 0.258489 (0.216861)	0.547834 / 0.293841 (0.253993)	0.058440 / 0.128546 (-0.070106)	0.013563 / 0.075646 (-0.062083)	0.084320 / 0.419271 (-0.334951)	0.065965 / 0.043533 (0.022433)	0.483541 / 0.255139 (0.228402)	0.513940 / 0.283200 (0.230740)	0.042889 / 0.141683 (-0.098794)	1.676050 / 1.452155 (0.223895)	1.759206 / 1.492716 (0.266489)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.274848 / 0.018006 (0.256841)	0.588965 / 0.000490 (0.588475)	0.006312 / 0.000200 (0.006112)	0.000120 / 0.000054 (0.000065)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033871 / 0.037411 (-0.003540)	0.104013 / 0.014526 (0.089487)	0.118457 / 0.176557 (-0.058099)	0.178268 / 0.737135 (-0.558868)	0.116972 / 0.296338 (-0.179366)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.609952 / 0.215209 (0.394743)	5.788754 / 2.077655 (3.711100)	2.812166 / 1.504120 (1.308046)	2.362861 / 1.541195 (0.821666)	2.641295 / 1.468490 (1.172804)	0.767601 / 4.584777 (-3.817176)	5.027439 / 3.745712 (1.281727)	4.612511 / 5.269862 (-0.657351)	2.654364 / 4.565676 (-1.911312)	0.103100 / 0.424275 (-0.321175)	0.012233 / 0.007607 (0.004626)	0.749283 / 0.226044 (0.523238)	7.511093 / 2.268929 (5.242165)	3.585867 / 55.444624 (-51.858757)	3.255110 / 6.876477 (-3.621366)	3.260174 / 2.142072 (1.118102)	0.958422 / 4.805227 (-3.846806)	0.209096 / 6.500664 (-6.291568)	0.075014 / 0.075469 (-0.000455)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.728283 / 1.841788 (-0.113504)	25.411147 / 8.074308 (17.336839)	21.335202 / 10.191392 (11.143810)	0.199090 / 0.680424 (-0.481334)	0.031288 / 0.534201 (-0.502913)	0.449226 / 0.579283 (-0.130057)	0.555570 / 0.434364 (0.121206)	0.570297 / 0.540337 (0.029960)	0.758673 / 1.386936 (-0.628263)

lhoestq

Good catch !

Yes a patch release would be welcome

github-actions · 2023-09-04T09:22:19Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006862 / 0.011353 (-0.004491)	0.003959 / 0.011008 (-0.007049)	0.087219 / 0.038508 (0.048711)	0.078335 / 0.023109 (0.055226)	0.319019 / 0.275898 (0.043121)	0.342871 / 0.323480 (0.019391)	0.004065 / 0.007986 (-0.003921)	0.004346 / 0.004328 (0.000017)	0.065243 / 0.004250 (0.060993)	0.056698 / 0.037052 (0.019646)	0.326906 / 0.258489 (0.068417)	0.354323 / 0.293841 (0.060482)	0.031252 / 0.128546 (-0.097295)	0.008587 / 0.075646 (-0.067060)	0.300323 / 0.419271 (-0.118948)	0.052810 / 0.043533 (0.009277)	0.323866 / 0.255139 (0.068727)	0.346011 / 0.283200 (0.062811)	0.025584 / 0.141683 (-0.116099)	1.464475 / 1.452155 (0.012320)	1.530868 / 1.492716 (0.038152)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.208927 / 0.018006 (0.190921)	0.454147 / 0.000490 (0.453657)	0.003945 / 0.000200 (0.003746)	0.000081 / 0.000054 (0.000026)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029901 / 0.037411 (-0.007511)	0.088889 / 0.014526 (0.074363)	0.098181 / 0.176557 (-0.078375)	0.156787 / 0.737135 (-0.580349)	0.099015 / 0.296338 (-0.197324)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.384981 / 0.215209 (0.169772)	3.831040 / 2.077655 (1.753386)	1.858312 / 1.504120 (0.354192)	1.686846 / 1.541195 (0.145651)	1.771509 / 1.468490 (0.303019)	0.485618 / 4.584777 (-4.099159)	3.430961 / 3.745712 (-0.314751)	3.264489 / 5.269862 (-2.005372)	2.040125 / 4.565676 (-2.525551)	0.057218 / 0.424275 (-0.367057)	0.007640 / 0.007607 (0.000033)	0.468072 / 0.226044 (0.242027)	4.677214 / 2.268929 (2.408286)	2.348425 / 55.444624 (-53.096199)	1.994352 / 6.876477 (-4.882125)	2.217020 / 2.142072 (0.074948)	0.587467 / 4.805227 (-4.217760)	0.133550 / 6.500664 (-6.367114)	0.060571 / 0.075469 (-0.014898)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.271003 / 1.841788 (-0.570785)	19.986365 / 8.074308 (11.912057)	14.574046 / 10.191392 (4.382654)	0.146212 / 0.680424 (-0.534212)	0.018320 / 0.534201 (-0.515881)	0.394524 / 0.579283 (-0.184759)	0.399707 / 0.434364 (-0.034657)	0.458965 / 0.540337 (-0.081372)	0.619940 / 1.386936 (-0.766996)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006982 / 0.011353 (-0.004371)	0.004061 / 0.011008 (-0.006947)	0.064520 / 0.038508 (0.026012)	0.076828 / 0.023109 (0.053719)	0.402989 / 0.275898 (0.127090)	0.439697 / 0.323480 (0.116217)	0.005511 / 0.007986 (-0.002475)	0.003378 / 0.004328 (-0.000950)	0.064727 / 0.004250 (0.060477)	0.058114 / 0.037052 (0.021062)	0.402054 / 0.258489 (0.143565)	0.442377 / 0.293841 (0.148536)	0.032808 / 0.128546 (-0.095738)	0.008604 / 0.075646 (-0.067043)	0.070994 / 0.419271 (-0.348278)	0.048738 / 0.043533 (0.005205)	0.399786 / 0.255139 (0.144647)	0.423537 / 0.283200 (0.140338)	0.022397 / 0.141683 (-0.119286)	1.504613 / 1.452155 (0.052458)	1.571064 / 1.492716 (0.078348)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.226876 / 0.018006 (0.208870)	0.451477 / 0.000490 (0.450987)	0.004511 / 0.000200 (0.004311)	0.000095 / 0.000054 (0.000041)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.032998 / 0.037411 (-0.004413)	0.095843 / 0.014526 (0.081317)	0.105684 / 0.176557 (-0.070873)	0.158175 / 0.737135 (-0.578960)	0.107297 / 0.296338 (-0.189041)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.434912 / 0.215209 (0.219703)	4.326394 / 2.077655 (2.248740)	2.287310 / 1.504120 (0.783190)	2.127987 / 1.541195 (0.586793)	2.202485 / 1.468490 (0.733995)	0.494305 / 4.584777 (-4.090472)	3.575176 / 3.745712 (-0.170536)	3.354358 / 5.269862 (-1.915504)	2.074293 / 4.565676 (-2.491383)	0.058967 / 0.424275 (-0.365308)	0.007712 / 0.007607 (0.000105)	0.513734 / 0.226044 (0.287690)	5.107538 / 2.268929 (2.838610)	2.776190 / 55.444624 (-52.668434)	2.425051 / 6.876477 (-4.451426)	2.666715 / 2.142072 (0.524643)	0.598844 / 4.805227 (-4.206383)	0.134186 / 6.500664 (-6.366478)	0.062403 / 0.075469 (-0.013066)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.346730 / 1.841788 (-0.495058)	20.533190 / 8.074308 (12.458882)	15.174443 / 10.191392 (4.983051)	0.167204 / 0.680424 (-0.513219)	0.020619 / 0.534201 (-0.513582)	0.399033 / 0.579283 (-0.180250)	0.394428 / 0.434364 (-0.039936)	0.468792 / 0.540337 (-0.071545)	0.640122 / 1.386936 (-0.746814)

* Rename zip_csv_path fixture dirname and filename * Test load no-script dataset with ZIP file * Fix style * Avoid filtering out .zip extension

albertvillanova added 3 commits September 4, 2023 07:57

Rename zip_csv_path fixture dirname and filename

c6d1f44

Test load no-script dataset with ZIP file

eb001b4

Fix style

d438617

Avoid filtering out .zip extension

23c5c36

albertvillanova changed the title ~~Do not filter .zip extensions~~ Do not filter out .zip extensions from no-script datasets Sep 4, 2023

Merge remote-tracking branch 'upstream/main' into fix-6207

fa696b4

lhoestq approved these changes Sep 4, 2023

View reviewed changes

albertvillanova merged commit 2c4c2b5 into main Sep 4, 2023

albertvillanova deleted the fix-6207 branch September 4, 2023 09:13

severo mentioned this pull request Sep 6, 2023

update datasets to 2.14.5 huggingface/dataset-viewer#1781

Closed

Do not filter out .zip extensions from no-script datasets #6208

Do not filter out .zip extensions from no-script datasets #6208

Uh oh!

Conversation

albertvillanova commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Sep 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

albertvillanova commented Sep 4, 2023

Uh oh!

github-actions bot commented Sep 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

albertvillanova commented Sep 4, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Sep 4, 2023 •

edited

Loading