Fix shard retry mechanism in `push_to_hub` #6461

mariosasko · 2023-11-30T14:57:14Z

When it fails, preupload_lfs_files throws a RuntimeError error and chains the original HTTP error. This PR modifies the retry mechanism's error handling to account for that.

Fix #6392

mariosasko · 2023-11-30T15:03:45Z

@Wauplin Maybe 504 should be added to the retry_on_status_codes tuple here to guard against #3872

Wauplin · 2023-11-30T15:09:45Z

We could but I'm not sure to have witness a 504 on S3 before. The issue reported in #3872 is a 504 on the /upload endpoint on the Hub and this is not an endpoint that is retried on this line.

github-actions · 2023-11-30T15:17:35Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005110 / 0.011353 (-0.006243)	0.003307 / 0.011008 (-0.007701)	0.062601 / 0.038508 (0.024093)	0.049644 / 0.023109 (0.026534)	0.243195 / 0.275898 (-0.032703)	0.273543 / 0.323480 (-0.049936)	0.003862 / 0.007986 (-0.004123)	0.002624 / 0.004328 (-0.001705)	0.048273 / 0.004250 (0.044023)	0.037820 / 0.037052 (0.000768)	0.249134 / 0.258489 (-0.009355)	0.319359 / 0.293841 (0.025518)	0.027816 / 0.128546 (-0.100730)	0.010422 / 0.075646 (-0.065225)	0.206607 / 0.419271 (-0.212665)	0.035719 / 0.043533 (-0.007814)	0.250300 / 0.255139 (-0.004839)	0.290377 / 0.283200 (0.007177)	0.018459 / 0.141683 (-0.123224)	1.114664 / 1.452155 (-0.337490)	1.171429 / 1.492716 (-0.321288)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091483 / 0.018006 (0.073477)	0.302770 / 0.000490 (0.302281)	0.000203 / 0.000200 (0.000003)	0.000047 / 0.000054 (-0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018870 / 0.037411 (-0.018541)	0.062692 / 0.014526 (0.048166)	0.075381 / 0.176557 (-0.101176)	0.122338 / 0.737135 (-0.614797)	0.075608 / 0.296338 (-0.220730)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.288115 / 0.215209 (0.072906)	2.816183 / 2.077655 (0.738528)	1.535601 / 1.504120 (0.031481)	1.409546 / 1.541195 (-0.131648)	1.438569 / 1.468490 (-0.029921)	0.561797 / 4.584777 (-4.022980)	2.373921 / 3.745712 (-1.371791)	2.739437 / 5.269862 (-2.530424)	1.750921 / 4.565676 (-2.814755)	0.062114 / 0.424275 (-0.362161)	0.004965 / 0.007607 (-0.002642)	0.348614 / 0.226044 (0.122569)	3.519631 / 2.268929 (1.250703)	1.910797 / 55.444624 (-53.533827)	1.610541 / 6.876477 (-5.265936)	1.617972 / 2.142072 (-0.524100)	0.639421 / 4.805227 (-4.165806)	0.117371 / 6.500664 (-6.383293)	0.041851 / 0.075469 (-0.033618)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.945563 / 1.841788 (-0.896224)	11.362399 / 8.074308 (3.288090)	10.468468 / 10.191392 (0.277075)	0.128925 / 0.680424 (-0.551499)	0.013892 / 0.534201 (-0.520309)	0.285487 / 0.579283 (-0.293796)	0.269295 / 0.434364 (-0.165069)	0.324843 / 0.540337 (-0.215495)	0.438452 / 1.386936 (-0.948484)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005303 / 0.011353 (-0.006050)	0.003162 / 0.011008 (-0.007846)	0.048177 / 0.038508 (0.009669)	0.048708 / 0.023109 (0.025599)	0.271663 / 0.275898 (-0.004235)	0.289948 / 0.323480 (-0.033532)	0.003955 / 0.007986 (-0.004030)	0.002616 / 0.004328 (-0.001713)	0.047510 / 0.004250 (0.043260)	0.039938 / 0.037052 (0.002886)	0.277449 / 0.258489 (0.018960)	0.300315 / 0.293841 (0.006474)	0.029263 / 0.128546 (-0.099283)	0.010403 / 0.075646 (-0.065244)	0.056682 / 0.419271 (-0.362590)	0.032757 / 0.043533 (-0.010776)	0.273291 / 0.255139 (0.018152)	0.289023 / 0.283200 (0.005824)	0.017843 / 0.141683 (-0.123840)	1.124762 / 1.452155 (-0.327393)	1.176646 / 1.492716 (-0.316070)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.004568 / 0.018006 (-0.013438)	0.300715 / 0.000490 (0.300225)	0.000212 / 0.000200 (0.000012)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021528 / 0.037411 (-0.015883)	0.068317 / 0.014526 (0.053792)	0.081358 / 0.176557 (-0.095199)	0.119297 / 0.737135 (-0.617838)	0.082445 / 0.296338 (-0.213893)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.289681 / 0.215209 (0.074472)	2.843862 / 2.077655 (0.766208)	1.574257 / 1.504120 (0.070137)	1.454026 / 1.541195 (-0.087169)	1.478379 / 1.468490 (0.009889)	0.558259 / 4.584777 (-4.026518)	2.513261 / 3.745712 (-1.232451)	2.759751 / 5.269862 (-2.510111)	1.730335 / 4.565676 (-2.835341)	0.063805 / 0.424275 (-0.360470)	0.004991 / 0.007607 (-0.002616)	0.346586 / 0.226044 (0.120542)	3.369163 / 2.268929 (1.100234)	1.934734 / 55.444624 (-53.509890)	1.658864 / 6.876477 (-5.217613)	1.645621 / 2.142072 (-0.496452)	0.636633 / 4.805227 (-4.168594)	0.116839 / 6.500664 (-6.383825)	0.040863 / 0.075469 (-0.034606)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.960925 / 1.841788 (-0.880863)	11.769189 / 8.074308 (3.694881)	10.713662 / 10.191392 (0.522270)	0.140510 / 0.680424 (-0.539914)	0.015424 / 0.534201 (-0.518777)	0.288039 / 0.579283 (-0.291244)	0.277623 / 0.434364 (-0.156741)	0.322622 / 0.540337 (-0.217716)	0.539805 / 1.386936 (-0.847131)

src/datasets/utils/hub.py

github-actions · 2023-11-30T18:50:53Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005501 / 0.011353 (-0.005852)	0.003754 / 0.011008 (-0.007254)	0.062628 / 0.038508 (0.024120)	0.059951 / 0.023109 (0.036842)	0.254851 / 0.275898 (-0.021047)	0.272133 / 0.323480 (-0.051347)	0.003962 / 0.007986 (-0.004024)	0.002759 / 0.004328 (-0.001569)	0.048412 / 0.004250 (0.044161)	0.039349 / 0.037052 (0.002297)	0.253093 / 0.258489 (-0.005397)	0.287048 / 0.293841 (-0.006793)	0.027197 / 0.128546 (-0.101349)	0.010828 / 0.075646 (-0.064819)	0.206371 / 0.419271 (-0.212901)	0.035881 / 0.043533 (-0.007652)	0.254905 / 0.255139 (-0.000234)	0.273819 / 0.283200 (-0.009381)	0.018041 / 0.141683 (-0.123642)	1.103970 / 1.452155 (-0.348185)	1.166340 / 1.492716 (-0.326377)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093196 / 0.018006 (0.075190)	0.302690 / 0.000490 (0.302200)	0.000219 / 0.000200 (0.000019)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019552 / 0.037411 (-0.017860)	0.062337 / 0.014526 (0.047811)	0.074070 / 0.176557 (-0.102486)	0.120998 / 0.737135 (-0.616137)	0.076265 / 0.296338 (-0.220074)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.272637 / 0.215209 (0.057427)	2.693350 / 2.077655 (0.615696)	1.398020 / 1.504120 (-0.106100)	1.285706 / 1.541195 (-0.255488)	1.342810 / 1.468490 (-0.125680)	0.565378 / 4.584777 (-4.019399)	2.390131 / 3.745712 (-1.355581)	2.892137 / 5.269862 (-2.377725)	1.819840 / 4.565676 (-2.745836)	0.062789 / 0.424275 (-0.361486)	0.004920 / 0.007607 (-0.002687)	0.329281 / 0.226044 (0.103237)	3.261664 / 2.268929 (0.992735)	1.775102 / 55.444624 (-53.669523)	1.514341 / 6.876477 (-5.362136)	1.530805 / 2.142072 (-0.611267)	0.641009 / 4.805227 (-4.164218)	0.118626 / 6.500664 (-6.382038)	0.042732 / 0.075469 (-0.032737)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.933179 / 1.841788 (-0.908609)	12.085247 / 8.074308 (4.010939)	10.541596 / 10.191392 (0.350204)	0.140141 / 0.680424 (-0.540283)	0.014646 / 0.534201 (-0.519555)	0.289640 / 0.579283 (-0.289643)	0.281042 / 0.434364 (-0.153322)	0.326462 / 0.540337 (-0.213876)	0.441981 / 1.386936 (-0.944955)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005259 / 0.011353 (-0.006094)	0.003766 / 0.011008 (-0.007242)	0.048782 / 0.038508 (0.010273)	0.064946 / 0.023109 (0.041836)	0.264529 / 0.275898 (-0.011369)	0.289675 / 0.323480 (-0.033805)	0.004057 / 0.007986 (-0.003928)	0.002805 / 0.004328 (-0.001523)	0.047709 / 0.004250 (0.043459)	0.041149 / 0.037052 (0.004096)	0.271254 / 0.258489 (0.012765)	0.296685 / 0.293841 (0.002844)	0.029486 / 0.128546 (-0.099060)	0.010608 / 0.075646 (-0.065038)	0.056392 / 0.419271 (-0.362879)	0.033181 / 0.043533 (-0.010352)	0.267029 / 0.255139 (0.011890)	0.284987 / 0.283200 (0.001787)	0.018045 / 0.141683 (-0.123637)	1.137358 / 1.452155 (-0.314796)	1.184007 / 1.492716 (-0.308709)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.004603 / 0.018006 (-0.013403)	0.303901 / 0.000490 (0.303411)	0.000225 / 0.000200 (0.000025)	0.000055 / 0.000054 (0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021957 / 0.037411 (-0.015454)	0.069427 / 0.014526 (0.054901)	0.082394 / 0.176557 (-0.094163)	0.120745 / 0.737135 (-0.616390)	0.084571 / 0.296338 (-0.211767)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.292832 / 0.215209 (0.077623)	2.824295 / 2.077655 (0.746640)	1.563273 / 1.504120 (0.059153)	1.440202 / 1.541195 (-0.100992)	1.489810 / 1.468490 (0.021320)	0.561120 / 4.584777 (-4.023657)	2.439045 / 3.745712 (-1.306667)	2.867139 / 5.269862 (-2.402722)	1.793812 / 4.565676 (-2.771865)	0.062797 / 0.424275 (-0.361478)	0.005033 / 0.007607 (-0.002574)	0.343648 / 0.226044 (0.117604)	3.432285 / 2.268929 (1.163357)	1.918175 / 55.444624 (-53.526449)	1.637245 / 6.876477 (-5.239232)	1.709246 / 2.142072 (-0.432826)	0.634744 / 4.805227 (-4.170483)	0.115782 / 6.500664 (-6.384882)	0.041228 / 0.075469 (-0.034241)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.962369 / 1.841788 (-0.879418)	12.750819 / 8.074308 (4.676511)	10.927356 / 10.191392 (0.735964)	0.143454 / 0.680424 (-0.536970)	0.015348 / 0.534201 (-0.518853)	0.291207 / 0.579283 (-0.288076)	0.276924 / 0.434364 (-0.157440)	0.327287 / 0.540337 (-0.213050)	0.577439 / 1.386936 (-0.809497)

lhoestq

LGTM !

github-actions · 2023-12-01T17:57:38Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005070 / 0.011353 (-0.006283)	0.003475 / 0.011008 (-0.007533)	0.061985 / 0.038508 (0.023477)	0.048539 / 0.023109 (0.025430)	0.229935 / 0.275898 (-0.045963)	0.255247 / 0.323480 (-0.068233)	0.003919 / 0.007986 (-0.004066)	0.002664 / 0.004328 (-0.001664)	0.048892 / 0.004250 (0.044642)	0.037381 / 0.037052 (0.000328)	0.238517 / 0.258489 (-0.019972)	0.284069 / 0.293841 (-0.009772)	0.027513 / 0.128546 (-0.101033)	0.010778 / 0.075646 (-0.064868)	0.205004 / 0.419271 (-0.214268)	0.035553 / 0.043533 (-0.007980)	0.230117 / 0.255139 (-0.025022)	0.251150 / 0.283200 (-0.032050)	0.017951 / 0.141683 (-0.123732)	1.145548 / 1.452155 (-0.306607)	1.191659 / 1.492716 (-0.301057)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092335 / 0.018006 (0.074329)	0.300264 / 0.000490 (0.299774)	0.000206 / 0.000200 (0.000006)	0.000050 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018608 / 0.037411 (-0.018804)	0.060376 / 0.014526 (0.045850)	0.073551 / 0.176557 (-0.103006)	0.118840 / 0.737135 (-0.618295)	0.074447 / 0.296338 (-0.221892)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.287033 / 0.215209 (0.071824)	2.770958 / 2.077655 (0.693303)	1.443986 / 1.504120 (-0.060134)	1.314627 / 1.541195 (-0.226567)	1.342287 / 1.468490 (-0.126203)	0.559607 / 4.584777 (-4.025170)	2.409678 / 3.745712 (-1.336034)	2.772566 / 5.269862 (-2.497295)	1.743511 / 4.565676 (-2.822165)	0.062277 / 0.424275 (-0.361998)	0.004952 / 0.007607 (-0.002655)	0.330581 / 0.226044 (0.104537)	3.280385 / 2.268929 (1.011456)	1.809599 / 55.444624 (-53.635025)	1.532186 / 6.876477 (-5.344290)	1.529689 / 2.142072 (-0.612383)	0.645213 / 4.805227 (-4.160014)	0.117564 / 6.500664 (-6.383100)	0.041657 / 0.075469 (-0.033812)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.943912 / 1.841788 (-0.897876)	11.414317 / 8.074308 (3.340009)	10.394915 / 10.191392 (0.203523)	0.129271 / 0.680424 (-0.551153)	0.013934 / 0.534201 (-0.520267)	0.288217 / 0.579283 (-0.291066)	0.267171 / 0.434364 (-0.167193)	0.327112 / 0.540337 (-0.213225)	0.446680 / 1.386936 (-0.940256)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005200 / 0.011353 (-0.006152)	0.003453 / 0.011008 (-0.007555)	0.048736 / 0.038508 (0.010228)	0.051073 / 0.023109 (0.027964)	0.276591 / 0.275898 (0.000693)	0.294495 / 0.323480 (-0.028985)	0.004069 / 0.007986 (-0.003917)	0.002945 / 0.004328 (-0.001383)	0.047090 / 0.004250 (0.042839)	0.040445 / 0.037052 (0.003393)	0.278464 / 0.258489 (0.019975)	0.304020 / 0.293841 (0.010179)	0.028811 / 0.128546 (-0.099736)	0.010388 / 0.075646 (-0.065259)	0.057214 / 0.419271 (-0.362057)	0.032588 / 0.043533 (-0.010945)	0.277694 / 0.255139 (0.022555)	0.294979 / 0.283200 (0.011779)	0.018384 / 0.141683 (-0.123299)	1.162332 / 1.452155 (-0.289822)	1.188355 / 1.492716 (-0.304361)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090501 / 0.018006 (0.072495)	0.303122 / 0.000490 (0.302632)	0.000222 / 0.000200 (0.000022)	0.000053 / 0.000054 (-0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022536 / 0.037411 (-0.014876)	0.068452 / 0.014526 (0.053926)	0.080932 / 0.176557 (-0.095625)	0.119185 / 0.737135 (-0.617950)	0.081513 / 0.296338 (-0.214825)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291522 / 0.215209 (0.076313)	2.849467 / 2.077655 (0.771812)	1.597395 / 1.504120 (0.093275)	1.512872 / 1.541195 (-0.028323)	1.488144 / 1.468490 (0.019654)	0.572436 / 4.584777 (-4.012341)	2.440129 / 3.745712 (-1.305583)	2.788045 / 5.269862 (-2.481817)	1.754246 / 4.565676 (-2.811430)	0.066706 / 0.424275 (-0.357569)	0.005035 / 0.007607 (-0.002573)	0.336621 / 0.226044 (0.110576)	3.322820 / 2.268929 (1.053891)	1.940494 / 55.444624 (-53.504130)	1.670022 / 6.876477 (-5.206454)	1.666353 / 2.142072 (-0.475720)	0.646180 / 4.805227 (-4.159047)	0.116676 / 6.500664 (-6.383988)	0.040559 / 0.075469 (-0.034910)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.971396 / 1.841788 (-0.870392)	11.782426 / 8.074308 (3.708118)	10.672034 / 10.191392 (0.480642)	0.137658 / 0.680424 (-0.542766)	0.016210 / 0.534201 (-0.517991)	0.288302 / 0.579283 (-0.290981)	0.280775 / 0.434364 (-0.153589)	0.326962 / 0.540337 (-0.213375)	0.558511 / 1.386936 (-0.828425)

mariosasko added 3 commits November 29, 2023 19:19

Retry on 500 server error in push_to_hub

273e075

Nit

523fcbe

Fix

06466bf

Fix tests

07ad81c

mariosasko marked this pull request as ready for review November 30, 2023 15:53

mariosasko requested a review from lhoestq November 30, 2023 15:53

lhoestq reviewed Nov 30, 2023

View reviewed changes

src/datasets/utils/hub.py Show resolved Hide resolved

Remove 504 error

544ad95

mariosasko requested a review from lhoestq December 1, 2023 15:31

lhoestq approved these changes Dec 1, 2023

View reviewed changes

mariosasko merged commit 7602018 into main Dec 1, 2023

mariosasko deleted the upload-hub-retries branch December 1, 2023 17:51

Fix shard retry mechanism in push_to_hub #6461

Fix shard retry mechanism in push_to_hub #6461

Uh oh!

Conversation

mariosasko commented Nov 30, 2023

Uh oh!

mariosasko commented Nov 30, 2023

Uh oh!

Wauplin commented Nov 30, 2023

Uh oh!

github-actions bot commented Nov 30, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Uh oh!

github-actions bot commented Nov 30, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Dec 1, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix shard retry mechanism in `push_to_hub` #6461

Fix shard retry mechanism in `push_to_hub` #6461