Use `filelock` package for file locking #6445

mariosasko · 2023-11-22T19:04:45Z

Use the filelock package instead of datasets.utils.filelock for file locking to be consistent with huggingface_hub and not to be responsible for improving the filelock capabilities 🙂.

(Reverts #859, but these INFO logs are not printed by default (anymore?), so this should be okay)

github-actions · 2023-11-22T19:06:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005431 / 0.011353 (-0.005922)	0.003255 / 0.011008 (-0.007753)	0.062867 / 0.038508 (0.024359)	0.051917 / 0.023109 (0.028808)	0.254229 / 0.275898 (-0.021669)	0.276949 / 0.323480 (-0.046531)	0.002868 / 0.007986 (-0.005117)	0.002539 / 0.004328 (-0.001789)	0.048366 / 0.004250 (0.044115)	0.038497 / 0.037052 (0.001445)	0.252158 / 0.258489 (-0.006332)	0.288868 / 0.293841 (-0.004973)	0.027956 / 0.128546 (-0.100591)	0.010500 / 0.075646 (-0.065147)	0.209263 / 0.419271 (-0.210008)	0.035415 / 0.043533 (-0.008118)	0.253104 / 0.255139 (-0.002035)	0.274646 / 0.283200 (-0.008554)	0.019923 / 0.141683 (-0.121760)	1.081870 / 1.452155 (-0.370285)	1.157159 / 1.492716 (-0.335557)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097420 / 0.018006 (0.079414)	0.315021 / 0.000490 (0.314531)	0.000218 / 0.000200 (0.000018)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018826 / 0.037411 (-0.018585)	0.061921 / 0.014526 (0.047395)	0.086825 / 0.176557 (-0.089731)	0.120606 / 0.737135 (-0.616529)	0.074344 / 0.296338 (-0.221994)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283238 / 0.215209 (0.068028)	2.771817 / 2.077655 (0.694162)	1.500194 / 1.504120 (-0.003926)	1.379286 / 1.541195 (-0.161908)	1.447747 / 1.468490 (-0.020743)	0.587176 / 4.584777 (-3.997601)	2.411260 / 3.745712 (-1.334452)	2.897682 / 5.269862 (-2.372180)	1.821720 / 4.565676 (-2.743957)	0.063299 / 0.424275 (-0.360976)	0.004969 / 0.007607 (-0.002639)	0.346417 / 0.226044 (0.120373)	3.432936 / 2.268929 (1.164007)	1.898662 / 55.444624 (-53.545963)	1.624339 / 6.876477 (-5.252138)	1.641653 / 2.142072 (-0.500419)	0.655773 / 4.805227 (-4.149454)	0.118588 / 6.500664 (-6.382076)	0.043919 / 0.075469 (-0.031551)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.949466 / 1.841788 (-0.892322)	12.378025 / 8.074308 (4.303717)	10.750942 / 10.191392 (0.559550)	0.146575 / 0.680424 (-0.533849)	0.015453 / 0.534201 (-0.518748)	0.290608 / 0.579283 (-0.288676)	0.273000 / 0.434364 (-0.161364)	0.328019 / 0.540337 (-0.212318)	0.417396 / 1.386936 (-0.969540)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005363 / 0.011353 (-0.005990)	0.003421 / 0.011008 (-0.007587)	0.049429 / 0.038508 (0.010920)	0.052774 / 0.023109 (0.029664)	0.274058 / 0.275898 (-0.001840)	0.297307 / 0.323480 (-0.026173)	0.004000 / 0.007986 (-0.003986)	0.002463 / 0.004328 (-0.001866)	0.048824 / 0.004250 (0.044574)	0.041064 / 0.037052 (0.004012)	0.279066 / 0.258489 (0.020577)	0.302420 / 0.293841 (0.008579)	0.029665 / 0.128546 (-0.098881)	0.010628 / 0.075646 (-0.065018)	0.057678 / 0.419271 (-0.361594)	0.032731 / 0.043533 (-0.010802)	0.274662 / 0.255139 (0.019523)	0.291878 / 0.283200 (0.008678)	0.018820 / 0.141683 (-0.122863)	1.124042 / 1.452155 (-0.328112)	1.175020 / 1.492716 (-0.317697)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.099419 / 0.018006 (0.081413)	0.311511 / 0.000490 (0.311022)	0.000228 / 0.000200 (0.000028)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022478 / 0.037411 (-0.014933)	0.071955 / 0.014526 (0.057429)	0.081423 / 0.176557 (-0.095134)	0.119574 / 0.737135 (-0.617561)	0.084724 / 0.296338 (-0.211615)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.295537 / 0.215209 (0.080328)	2.893855 / 2.077655 (0.816201)	1.602065 / 1.504120 (0.097945)	1.478193 / 1.541195 (-0.063002)	1.508250 / 1.468490 (0.039760)	0.566140 / 4.584777 (-4.018637)	2.455474 / 3.745712 (-1.290238)	2.849525 / 5.269862 (-2.420337)	1.763830 / 4.565676 (-2.801846)	0.062375 / 0.424275 (-0.361900)	0.004992 / 0.007607 (-0.002615)	0.346068 / 0.226044 (0.120023)	3.452421 / 2.268929 (1.183492)	1.970346 / 55.444624 (-53.474278)	1.690865 / 6.876477 (-5.185612)	1.705358 / 2.142072 (-0.436714)	0.644261 / 4.805227 (-4.160967)	0.120596 / 6.500664 (-6.380068)	0.042699 / 0.075469 (-0.032770)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.980506 / 1.841788 (-0.861281)	12.401901 / 8.074308 (4.327593)	11.169413 / 10.191392 (0.978021)	0.142540 / 0.680424 (-0.537884)	0.015730 / 0.534201 (-0.518471)	0.288871 / 0.579283 (-0.290412)	0.287487 / 0.434364 (-0.146877)	0.325133 / 0.540337 (-0.215204)	0.417979 / 1.386936 (-0.968957)

HuggingFaceDocBuilderDev · 2023-11-22T19:09:32Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-11-22T20:24:14Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005062 / 0.011353 (-0.006291)	0.003024 / 0.011008 (-0.007984)	0.061801 / 0.038508 (0.023293)	0.048934 / 0.023109 (0.025825)	0.248024 / 0.275898 (-0.027874)	0.265665 / 0.323480 (-0.057815)	0.003885 / 0.007986 (-0.004100)	0.002371 / 0.004328 (-0.001957)	0.047895 / 0.004250 (0.043644)	0.039015 / 0.037052 (0.001963)	0.252320 / 0.258489 (-0.006169)	0.286533 / 0.293841 (-0.007308)	0.027694 / 0.128546 (-0.100852)	0.010254 / 0.075646 (-0.065392)	0.206586 / 0.419271 (-0.212685)	0.035681 / 0.043533 (-0.007852)	0.251645 / 0.255139 (-0.003494)	0.285462 / 0.283200 (0.002262)	0.017326 / 0.141683 (-0.124357)	1.086927 / 1.452155 (-0.365228)	1.153172 / 1.492716 (-0.339545)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093020 / 0.018006 (0.075014)	0.300018 / 0.000490 (0.299528)	0.000208 / 0.000200 (0.000008)	0.000047 / 0.000054 (-0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018828 / 0.037411 (-0.018584)	0.062569 / 0.014526 (0.048043)	0.074130 / 0.176557 (-0.102427)	0.119304 / 0.737135 (-0.617832)	0.076409 / 0.296338 (-0.219930)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.285938 / 0.215209 (0.070729)	2.780662 / 2.077655 (0.703007)	1.522401 / 1.504120 (0.018281)	1.392475 / 1.541195 (-0.148720)	1.412517 / 1.468490 (-0.055973)	0.562768 / 4.584777 (-4.022009)	2.421406 / 3.745712 (-1.324306)	2.786271 / 5.269862 (-2.483591)	1.737193 / 4.565676 (-2.828484)	0.062775 / 0.424275 (-0.361500)	0.004908 / 0.007607 (-0.002699)	0.345070 / 0.226044 (0.119026)	3.383700 / 2.268929 (1.114771)	1.795974 / 55.444624 (-53.648651)	1.527656 / 6.876477 (-5.348820)	1.514035 / 2.142072 (-0.628037)	0.647652 / 4.805227 (-4.157575)	0.120121 / 6.500664 (-6.380543)	0.042259 / 0.075469 (-0.033210)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.948951 / 1.841788 (-0.892837)	11.514971 / 8.074308 (3.440663)	10.722668 / 10.191392 (0.531276)	0.143034 / 0.680424 (-0.537390)	0.014800 / 0.534201 (-0.519401)	0.286189 / 0.579283 (-0.293094)	0.270735 / 0.434364 (-0.163629)	0.323907 / 0.540337 (-0.216430)	0.417569 / 1.386936 (-0.969367)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005670 / 0.011353 (-0.005683)	0.003238 / 0.011008 (-0.007770)	0.048520 / 0.038508 (0.010012)	0.051341 / 0.023109 (0.028232)	0.273883 / 0.275898 (-0.002015)	0.295165 / 0.323480 (-0.028315)	0.004755 / 0.007986 (-0.003231)	0.002471 / 0.004328 (-0.001857)	0.047487 / 0.004250 (0.043237)	0.040225 / 0.037052 (0.003172)	0.276758 / 0.258489 (0.018269)	0.301182 / 0.293841 (0.007341)	0.029749 / 0.128546 (-0.098797)	0.010340 / 0.075646 (-0.065306)	0.057193 / 0.419271 (-0.362079)	0.033067 / 0.043533 (-0.010466)	0.272716 / 0.255139 (0.017577)	0.292301 / 0.283200 (0.009101)	0.019075 / 0.141683 (-0.122608)	1.101778 / 1.452155 (-0.350376)	1.173573 / 1.492716 (-0.319143)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091008 / 0.018006 (0.073002)	0.300749 / 0.000490 (0.300259)	0.000218 / 0.000200 (0.000018)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021760 / 0.037411 (-0.015651)	0.071407 / 0.014526 (0.056881)	0.081151 / 0.176557 (-0.095406)	0.120140 / 0.737135 (-0.616995)	0.082408 / 0.296338 (-0.213931)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294828 / 0.215209 (0.079619)	2.880701 / 2.077655 (0.803047)	1.604187 / 1.504120 (0.100068)	1.479236 / 1.541195 (-0.061959)	1.498875 / 1.468490 (0.030385)	0.561950 / 4.584777 (-4.022827)	2.462531 / 3.745712 (-1.283181)	2.800905 / 5.269862 (-2.468957)	1.746535 / 4.565676 (-2.819141)	0.062732 / 0.424275 (-0.361544)	0.004932 / 0.007607 (-0.002675)	0.347125 / 0.226044 (0.121081)	3.431343 / 2.268929 (1.162415)	1.964999 / 55.444624 (-53.479625)	1.669709 / 6.876477 (-5.206768)	1.675148 / 2.142072 (-0.466924)	0.635436 / 4.805227 (-4.169792)	0.116598 / 6.500664 (-6.384066)	0.041447 / 0.075469 (-0.034022)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.975751 / 1.841788 (-0.866037)	12.060246 / 8.074308 (3.985938)	10.871641 / 10.191392 (0.680249)	0.142936 / 0.680424 (-0.537488)	0.015779 / 0.534201 (-0.518422)	0.287120 / 0.579283 (-0.292163)	0.283963 / 0.434364 (-0.150401)	0.341231 / 0.540337 (-0.199107)	0.419518 / 1.386936 (-0.967418)

lhoestq

Thanks ! always great to remove code ^^

github-actions · 2023-11-23T18:47:30Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005105 / 0.011353 (-0.006248)	0.002855 / 0.011008 (-0.008153)	0.062044 / 0.038508 (0.023536)	0.052948 / 0.023109 (0.029839)	0.249841 / 0.275898 (-0.026057)	0.276687 / 0.323480 (-0.046792)	0.003792 / 0.007986 (-0.004194)	0.002385 / 0.004328 (-0.001943)	0.048648 / 0.004250 (0.044398)	0.038317 / 0.037052 (0.001264)	0.255235 / 0.258489 (-0.003254)	0.287870 / 0.293841 (-0.005971)	0.027429 / 0.128546 (-0.101117)	0.010182 / 0.075646 (-0.065464)	0.206980 / 0.419271 (-0.212291)	0.035444 / 0.043533 (-0.008089)	0.255073 / 0.255139 (-0.000066)	0.270636 / 0.283200 (-0.012563)	0.018003 / 0.141683 (-0.123680)	1.124691 / 1.452155 (-0.327463)	1.191872 / 1.492716 (-0.300844)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.088824 / 0.018006 (0.070818)	0.302771 / 0.000490 (0.302281)	0.000210 / 0.000200 (0.000010)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018102 / 0.037411 (-0.019310)	0.062131 / 0.014526 (0.047605)	0.073230 / 0.176557 (-0.103327)	0.119789 / 0.737135 (-0.617346)	0.074804 / 0.296338 (-0.221534)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.293244 / 0.215209 (0.078035)	2.891401 / 2.077655 (0.813746)	1.504481 / 1.504120 (0.000361)	1.381251 / 1.541195 (-0.159944)	1.387245 / 1.468490 (-0.081245)	0.552732 / 4.584777 (-4.032045)	2.386439 / 3.745712 (-1.359273)	2.718918 / 5.269862 (-2.550944)	1.725401 / 4.565676 (-2.840275)	0.061946 / 0.424275 (-0.362329)	0.004957 / 0.007607 (-0.002650)	0.342776 / 0.226044 (0.116731)	3.418911 / 2.268929 (1.149983)	1.838283 / 55.444624 (-53.606341)	1.538013 / 6.876477 (-5.338464)	1.545144 / 2.142072 (-0.596928)	0.637857 / 4.805227 (-4.167370)	0.116451 / 6.500664 (-6.384213)	0.042228 / 0.075469 (-0.033241)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.943575 / 1.841788 (-0.898212)	11.492939 / 8.074308 (3.418631)	10.601605 / 10.191392 (0.410212)	0.139084 / 0.680424 (-0.541340)	0.013691 / 0.534201 (-0.520510)	0.286696 / 0.579283 (-0.292587)	0.259979 / 0.434364 (-0.174385)	0.322578 / 0.540337 (-0.217759)	0.411950 / 1.386936 (-0.974986)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005168 / 0.011353 (-0.006185)	0.003238 / 0.011008 (-0.007770)	0.049028 / 0.038508 (0.010520)	0.052930 / 0.023109 (0.029821)	0.274750 / 0.275898 (-0.001148)	0.294023 / 0.323480 (-0.029457)	0.003829 / 0.007986 (-0.004157)	0.002372 / 0.004328 (-0.001956)	0.048689 / 0.004250 (0.044439)	0.040056 / 0.037052 (0.003003)	0.280147 / 0.258489 (0.021658)	0.304871 / 0.293841 (0.011030)	0.028734 / 0.128546 (-0.099812)	0.010624 / 0.075646 (-0.065022)	0.058705 / 0.419271 (-0.360566)	0.032140 / 0.043533 (-0.011393)	0.276702 / 0.255139 (0.021563)	0.293186 / 0.283200 (0.009987)	0.018124 / 0.141683 (-0.123559)	1.139398 / 1.452155 (-0.312757)	1.174862 / 1.492716 (-0.317855)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.087627 / 0.018006 (0.069620)	0.298376 / 0.000490 (0.297886)	0.000238 / 0.000200 (0.000038)	0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021344 / 0.037411 (-0.016067)	0.070208 / 0.014526 (0.055682)	0.081177 / 0.176557 (-0.095380)	0.120170 / 0.737135 (-0.616965)	0.082472 / 0.296338 (-0.213866)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.293227 / 0.215209 (0.078018)	2.844619 / 2.077655 (0.766964)	1.586922 / 1.504120 (0.082803)	1.460256 / 1.541195 (-0.080938)	1.475955 / 1.468490 (0.007465)	0.553226 / 4.584777 (-4.031551)	2.418869 / 3.745712 (-1.326843)	2.709256 / 5.269862 (-2.560606)	1.705935 / 4.565676 (-2.859741)	0.062391 / 0.424275 (-0.361884)	0.004929 / 0.007607 (-0.002678)	0.350358 / 0.226044 (0.124313)	3.448824 / 2.268929 (1.179896)	1.929451 / 55.444624 (-53.515174)	1.669438 / 6.876477 (-5.207038)	1.660923 / 2.142072 (-0.481150)	0.633107 / 4.805227 (-4.172120)	0.114657 / 6.500664 (-6.386007)	0.041256 / 0.075469 (-0.034214)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.968408 / 1.841788 (-0.873380)	11.749754 / 8.074308 (3.675446)	10.796670 / 10.191392 (0.605278)	0.128881 / 0.680424 (-0.551543)	0.015326 / 0.534201 (-0.518875)	0.286407 / 0.579283 (-0.292876)	0.276324 / 0.434364 (-0.158040)	0.326201 / 0.540337 (-0.214136)	0.419854 / 1.386936 (-0.967082)

mariosasko added 4 commits November 22, 2023 19:07

Use filelock package for file locking

be0a9f6

Fixes

15b082a

Fix test

c94059d

Update setup.py

9656858

Fix metric tests

0943ff0

mariosasko requested a review from lhoestq November 22, 2023 23:05

lhoestq approved these changes Nov 23, 2023

View reviewed changes

mariosasko merged commit 1731d5a into main Nov 23, 2023

mariosasko deleted the filelock-package branch November 23, 2023 18:41

grapefroot mentioned this pull request Dec 8, 2023

"File name too long" error for file locks #2924

Closed

minhopark-neubla mentioned this pull request Jan 15, 2024

After 2.16.0 version, there are PermissionError when users use shared cache_dir #6589

Closed

Use filelock package for file locking #6445

Use filelock package for file locking #6445

Uh oh!

Conversation

mariosasko commented Nov 22, 2023

Uh oh!

github-actions bot commented Nov 22, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Nov 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 22, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Use `filelock` package for file locking #6445

Use `filelock` package for file locking #6445

HuggingFaceDocBuilderDev commented Nov 22, 2023 •

edited

Loading