Create DatasetNotFoundError and DataFilesNotFoundError #6431

albertvillanova · 2023-11-16T16:02:55Z

Create DatasetNotFoundError and DataFilesNotFoundError.

Fix #6397.

CC: @severo

src/datasets/load.py

github-actions · 2023-11-16T16:07:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004459 / 0.011353 (-0.006894)	0.002883 / 0.011008 (-0.008125)	0.062434 / 0.038508 (0.023925)	0.030353 / 0.023109 (0.007244)	0.256696 / 0.275898 (-0.019202)	0.280557 / 0.323480 (-0.042923)	0.003903 / 0.007986 (-0.004083)	0.002424 / 0.004328 (-0.001905)	0.048509 / 0.004250 (0.044259)	0.043583 / 0.037052 (0.006531)	0.253900 / 0.258489 (-0.004590)	0.309146 / 0.293841 (0.015305)	0.023253 / 0.128546 (-0.105294)	0.007073 / 0.075646 (-0.068573)	0.204118 / 0.419271 (-0.215154)	0.056429 / 0.043533 (0.012897)	0.247331 / 0.255139 (-0.007808)	0.271581 / 0.283200 (-0.011619)	0.017021 / 0.141683 (-0.124662)	1.115057 / 1.452155 (-0.337098)	1.209947 / 1.492716 (-0.282770)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093141 / 0.018006 (0.075134)	0.295987 / 0.000490 (0.295497)	0.000221 / 0.000200 (0.000021)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019182 / 0.037411 (-0.018230)	0.062049 / 0.014526 (0.047523)	0.073824 / 0.176557 (-0.102733)	0.120175 / 0.737135 (-0.616960)	0.074700 / 0.296338 (-0.221639)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.280036 / 0.215209 (0.064827)	2.731512 / 2.077655 (0.653857)	1.414606 / 1.504120 (-0.089514)	1.302433 / 1.541195 (-0.238761)	1.313012 / 1.468490 (-0.155478)	0.399722 / 4.584777 (-4.185055)	2.371249 / 3.745712 (-1.374463)	2.582520 / 5.269862 (-2.687342)	1.558505 / 4.565676 (-3.007171)	0.045765 / 0.424275 (-0.378510)	0.004748 / 0.007607 (-0.002859)	0.327623 / 0.226044 (0.101578)	3.258742 / 2.268929 (0.989814)	1.756798 / 55.444624 (-53.687826)	1.494551 / 6.876477 (-5.381925)	1.518161 / 2.142072 (-0.623911)	0.468560 / 4.805227 (-4.336667)	0.101034 / 6.500664 (-6.399630)	0.048259 / 0.075469 (-0.027210)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.938146 / 1.841788 (-0.903642)	11.636387 / 8.074308 (3.562078)	10.638909 / 10.191392 (0.447517)	0.128340 / 0.680424 (-0.552084)	0.015194 / 0.534201 (-0.519007)	0.275961 / 0.579283 (-0.303322)	0.264629 / 0.434364 (-0.169735)	0.308580 / 0.540337 (-0.231758)	0.433658 / 1.386936 (-0.953278)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004797 / 0.011353 (-0.006556)	0.002801 / 0.011008 (-0.008208)	0.048101 / 0.038508 (0.009593)	0.056406 / 0.023109 (0.033296)	0.274966 / 0.275898 (-0.000932)	0.298310 / 0.323480 (-0.025170)	0.004115 / 0.007986 (-0.003871)	0.002437 / 0.004328 (-0.001891)	0.047921 / 0.004250 (0.043671)	0.038812 / 0.037052 (0.001760)	0.279594 / 0.258489 (0.021105)	0.313703 / 0.293841 (0.019862)	0.024485 / 0.128546 (-0.104061)	0.007095 / 0.075646 (-0.068551)	0.053398 / 0.419271 (-0.365874)	0.032306 / 0.043533 (-0.011227)	0.278014 / 0.255139 (0.022875)	0.301156 / 0.283200 (0.017956)	0.017353 / 0.141683 (-0.124330)	1.150168 / 1.452155 (-0.301987)	1.190822 / 1.492716 (-0.301894)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092162 / 0.018006 (0.074156)	0.301031 / 0.000490 (0.300541)	0.000244 / 0.000200 (0.000044)	0.000062 / 0.000054 (0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.020918 / 0.037411 (-0.016494)	0.072030 / 0.014526 (0.057504)	0.081813 / 0.176557 (-0.094743)	0.120233 / 0.737135 (-0.616903)	0.082874 / 0.296338 (-0.213465)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291659 / 0.215209 (0.076450)	2.841978 / 2.077655 (0.764323)	1.594207 / 1.504120 (0.090087)	1.473941 / 1.541195 (-0.067254)	1.514393 / 1.468490 (0.045903)	0.393393 / 4.584777 (-4.191384)	2.443663 / 3.745712 (-1.302050)	2.545747 / 5.269862 (-2.724114)	1.521130 / 4.565676 (-3.044546)	0.046246 / 0.424275 (-0.378030)	0.004826 / 0.007607 (-0.002781)	0.340909 / 0.226044 (0.114865)	3.319474 / 2.268929 (1.050546)	1.933110 / 55.444624 (-53.511515)	1.662463 / 6.876477 (-5.214014)	1.670331 / 2.142072 (-0.471742)	0.458062 / 4.805227 (-4.347165)	0.098397 / 6.500664 (-6.402267)	0.041339 / 0.075469 (-0.034130)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.973718 / 1.841788 (-0.868070)	12.095266 / 8.074308 (4.020957)	10.761212 / 10.191392 (0.569820)	0.142352 / 0.680424 (-0.538072)	0.015423 / 0.534201 (-0.518778)	0.270912 / 0.579283 (-0.308371)	0.276618 / 0.434364 (-0.157746)	0.309120 / 0.540337 (-0.231217)	0.415330 / 1.386936 (-0.971606)

HuggingFaceDocBuilderDev · 2023-11-16T16:08:10Z

The documentation is not available anymore as the PR was closed or merged.

src/datasets/exceptions.py

github-actions · 2023-11-17T13:15:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004676 / 0.011353 (-0.006677)	0.003101 / 0.011008 (-0.007907)	0.062260 / 0.038508 (0.023752)	0.030012 / 0.023109 (0.006903)	0.253704 / 0.275898 (-0.022194)	0.276404 / 0.323480 (-0.047075)	0.004060 / 0.007986 (-0.003926)	0.002467 / 0.004328 (-0.001861)	0.047921 / 0.004250 (0.043670)	0.045760 / 0.037052 (0.008708)	0.254529 / 0.258489 (-0.003960)	0.286283 / 0.293841 (-0.007558)	0.023301 / 0.128546 (-0.105246)	0.007407 / 0.075646 (-0.068239)	0.204541 / 0.419271 (-0.214730)	0.056387 / 0.043533 (0.012854)	0.252120 / 0.255139 (-0.003019)	0.275795 / 0.283200 (-0.007404)	0.018648 / 0.141683 (-0.123034)	1.113484 / 1.452155 (-0.338671)	1.168685 / 1.492716 (-0.324031)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.098286 / 0.018006 (0.080280)	0.304619 / 0.000490 (0.304129)	0.000225 / 0.000200 (0.000025)	0.000058 / 0.000054 (0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019183 / 0.037411 (-0.018229)	0.062183 / 0.014526 (0.047657)	0.074288 / 0.176557 (-0.102269)	0.120576 / 0.737135 (-0.616560)	0.074833 / 0.296338 (-0.221505)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.280512 / 0.215209 (0.065303)	2.770052 / 2.077655 (0.692397)	1.471234 / 1.504120 (-0.032886)	1.352080 / 1.541195 (-0.189114)	1.374518 / 1.468490 (-0.093973)	0.407108 / 4.584777 (-4.177669)	2.400581 / 3.745712 (-1.345131)	2.677507 / 5.269862 (-2.592355)	1.578042 / 4.565676 (-2.987635)	0.048539 / 0.424275 (-0.375736)	0.004905 / 0.007607 (-0.002703)	0.346676 / 0.226044 (0.120631)	3.367732 / 2.268929 (1.098803)	1.844405 / 55.444624 (-53.600220)	1.576883 / 6.876477 (-5.299594)	1.666986 / 2.142072 (-0.475086)	0.495872 / 4.805227 (-4.309355)	0.103142 / 6.500664 (-6.397522)	0.044037 / 0.075469 (-0.031432)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.980865 / 1.841788 (-0.860923)	12.268525 / 8.074308 (4.194217)	10.756554 / 10.191392 (0.565162)	0.129954 / 0.680424 (-0.550470)	0.013864 / 0.534201 (-0.520337)	0.267653 / 0.579283 (-0.311630)	0.265120 / 0.434364 (-0.169244)	0.309050 / 0.540337 (-0.231288)	0.423877 / 1.386936 (-0.963059)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005074 / 0.011353 (-0.006279)	0.003001 / 0.011008 (-0.008007)	0.048271 / 0.038508 (0.009763)	0.061206 / 0.023109 (0.038097)	0.279268 / 0.275898 (0.003370)	0.302592 / 0.323480 (-0.020888)	0.004177 / 0.007986 (-0.003809)	0.002452 / 0.004328 (-0.001876)	0.048259 / 0.004250 (0.044009)	0.040032 / 0.037052 (0.002979)	0.281398 / 0.258489 (0.022909)	0.314121 / 0.293841 (0.020280)	0.025137 / 0.128546 (-0.103409)	0.007230 / 0.075646 (-0.068416)	0.054537 / 0.419271 (-0.364735)	0.033266 / 0.043533 (-0.010267)	0.277305 / 0.255139 (0.022166)	0.295993 / 0.283200 (0.012794)	0.019278 / 0.141683 (-0.122405)	1.131700 / 1.452155 (-0.320454)	1.183848 / 1.492716 (-0.308868)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092258 / 0.018006 (0.074251)	0.310668 / 0.000490 (0.310178)	0.000219 / 0.000200 (0.000019)	0.000047 / 0.000054 (-0.000008)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021838 / 0.037411 (-0.015574)	0.071382 / 0.014526 (0.056857)	0.081389 / 0.176557 (-0.095168)	0.120389 / 0.737135 (-0.616746)	0.084135 / 0.296338 (-0.212203)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291676 / 0.215209 (0.076467)	2.840623 / 2.077655 (0.762968)	1.565748 / 1.504120 (0.061628)	1.452529 / 1.541195 (-0.088666)	1.490633 / 1.468490 (0.022143)	0.402878 / 4.584777 (-4.181899)	2.486192 / 3.745712 (-1.259520)	2.520563 / 5.269862 (-2.749299)	1.518550 / 4.565676 (-3.047127)	0.047423 / 0.424275 (-0.376852)	0.004823 / 0.007607 (-0.002784)	0.353122 / 0.226044 (0.127078)	3.452136 / 2.268929 (1.183208)	1.973798 / 55.444624 (-53.470827)	1.669569 / 6.876477 (-5.206907)	1.654910 / 2.142072 (-0.487163)	0.486746 / 4.805227 (-4.318481)	0.097260 / 6.500664 (-6.403404)	0.040608 / 0.075469 (-0.034861)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.989705 / 1.841788 (-0.852083)	12.114386 / 8.074308 (4.040077)	11.284551 / 10.191392 (1.093159)	0.141408 / 0.680424 (-0.539016)	0.015275 / 0.534201 (-0.518926)	0.267407 / 0.579283 (-0.311877)	0.281007 / 0.434364 (-0.153357)	0.309617 / 0.540337 (-0.230720)	0.414033 / 1.386936 (-0.972903)

github-actions · 2023-11-20T17:08:20Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004888 / 0.011353 (-0.006465)	0.002775 / 0.011008 (-0.008233)	0.062000 / 0.038508 (0.023492)	0.050694 / 0.023109 (0.027584)	0.257063 / 0.275898 (-0.018835)	0.282743 / 0.323480 (-0.040736)	0.002862 / 0.007986 (-0.005124)	0.002305 / 0.004328 (-0.002023)	0.049549 / 0.004250 (0.045299)	0.038754 / 0.037052 (0.001701)	0.264047 / 0.258489 (0.005558)	0.310162 / 0.293841 (0.016321)	0.022901 / 0.128546 (-0.105645)	0.006894 / 0.075646 (-0.068752)	0.202467 / 0.419271 (-0.216805)	0.035901 / 0.043533 (-0.007631)	0.262344 / 0.255139 (0.007205)	0.285563 / 0.283200 (0.002364)	0.017070 / 0.141683 (-0.124613)	1.113972 / 1.452155 (-0.338182)	1.176261 / 1.492716 (-0.316455)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092912 / 0.018006 (0.074906)	0.302610 / 0.000490 (0.302120)	0.000204 / 0.000200 (0.000005)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018232 / 0.037411 (-0.019179)	0.062367 / 0.014526 (0.047841)	0.074570 / 0.176557 (-0.101987)	0.120468 / 0.737135 (-0.616668)	0.075187 / 0.296338 (-0.221151)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279760 / 0.215209 (0.064551)	2.715372 / 2.077655 (0.637717)	1.461636 / 1.504120 (-0.042484)	1.324220 / 1.541195 (-0.216975)	1.350724 / 1.468490 (-0.117766)	0.395648 / 4.584777 (-4.189129)	2.376548 / 3.745712 (-1.369164)	2.594662 / 5.269862 (-2.675200)	1.553528 / 4.565676 (-3.012148)	0.047875 / 0.424275 (-0.376400)	0.005287 / 0.007607 (-0.002321)	0.334734 / 0.226044 (0.108689)	3.294753 / 2.268929 (1.025825)	1.797901 / 55.444624 (-53.646724)	1.510907 / 6.876477 (-5.365570)	1.536070 / 2.142072 (-0.606003)	0.474672 / 4.805227 (-4.330555)	0.099323 / 6.500664 (-6.401341)	0.041703 / 0.075469 (-0.033766)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.947441 / 1.841788 (-0.894347)	11.451378 / 8.074308 (3.377070)	10.283213 / 10.191392 (0.091821)	0.131032 / 0.680424 (-0.549392)	0.014423 / 0.534201 (-0.519777)	0.272568 / 0.579283 (-0.306715)	0.267127 / 0.434364 (-0.167237)	0.307361 / 0.540337 (-0.232976)	0.403858 / 1.386936 (-0.983078)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004836 / 0.011353 (-0.006517)	0.002544 / 0.011008 (-0.008464)	0.047979 / 0.038508 (0.009471)	0.052211 / 0.023109 (0.029102)	0.273394 / 0.275898 (-0.002504)	0.291202 / 0.323480 (-0.032277)	0.004094 / 0.007986 (-0.003891)	0.002415 / 0.004328 (-0.001914)	0.048057 / 0.004250 (0.043807)	0.039756 / 0.037052 (0.002703)	0.277301 / 0.258489 (0.018812)	0.297626 / 0.293841 (0.003785)	0.024641 / 0.128546 (-0.103905)	0.006957 / 0.075646 (-0.068690)	0.053574 / 0.419271 (-0.365697)	0.036532 / 0.043533 (-0.007001)	0.273753 / 0.255139 (0.018614)	0.294254 / 0.283200 (0.011054)	0.022252 / 0.141683 (-0.119431)	1.128609 / 1.452155 (-0.323546)	1.217322 / 1.492716 (-0.275394)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091050 / 0.018006 (0.073044)	0.300089 / 0.000490 (0.299600)	0.000215 / 0.000200 (0.000015)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021423 / 0.037411 (-0.015988)	0.069892 / 0.014526 (0.055366)	0.081125 / 0.176557 (-0.095432)	0.118725 / 0.737135 (-0.618411)	0.081357 / 0.296338 (-0.214981)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.295046 / 0.215209 (0.079837)	2.868813 / 2.077655 (0.791159)	1.579613 / 1.504120 (0.075493)	1.449308 / 1.541195 (-0.091887)	1.478804 / 1.468490 (0.010314)	0.416916 / 4.584777 (-4.167861)	2.461093 / 3.745712 (-1.284619)	2.449792 / 5.269862 (-2.820070)	1.573930 / 4.565676 (-2.991746)	0.046808 / 0.424275 (-0.377467)	0.004811 / 0.007607 (-0.002796)	0.352805 / 0.226044 (0.126761)	3.495034 / 2.268929 (1.226105)	1.952019 / 55.444624 (-53.492606)	1.642607 / 6.876477 (-5.233869)	1.775235 / 2.142072 (-0.366837)	0.482196 / 4.805227 (-4.323032)	0.099562 / 6.500664 (-6.401102)	0.040709 / 0.075469 (-0.034760)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.972750 / 1.841788 (-0.869038)	11.905172 / 8.074308 (3.830864)	10.613847 / 10.191392 (0.422455)	0.129892 / 0.680424 (-0.550532)	0.015611 / 0.534201 (-0.518590)	0.271884 / 0.579283 (-0.307400)	0.275270 / 0.434364 (-0.159094)	0.303213 / 0.540337 (-0.237125)	0.402338 / 1.386936 (-0.984598)

albertvillanova · 2023-11-20T17:33:39Z

I think this PR can be merged.

severo · 2023-11-20T19:07:59Z

you already have an approval, feel free to merge!

mariosasko · 2023-11-21T01:35:56Z

src/datasets/exceptions.py

+class DatasetNotFoundError(FileNotFoundDatasetsError):
+    """Dataset not found.
+
+    Raised when trying to access:
+    - a missing dataset, or
+    - a private/gated dataset and the user is not authenticated.
+    """


Maybe we could re-use huggingface_hub's RepositoryNotFound and GatedRepo errors instead of introducing our own (this exception should at least subclass them)

For example, transformers throws an EnvironmentError (with the error description) and chains the caught (huggingface_hub) exception, so consistency with them would be nice.

I agree we could sub-class huggingface_hub errors as well, but at the same time, sub-classing our datasets DatasetsError base class, so that a user of datasets can catch all datasets errors by using this DatasetsError class.

But anyway, we are just replacing the FileNotFound errors we were previously raising from datasets, whereas huggingface_hub RepositoryNotFoundErrorsubclassesrequests.HTTPError(through the classHfHubHTTPError) not FileNotFoundError. I see this as adding unnecessary complexity to our datasets` error hierarchy.

I would propose to address in a subsequent PR to catch specific huggingface_hub errors (instead of parsing the HTTP error status code: this is done by huggingface_hub) and raise our specific errors accordingly.

src/datasets/load.py

github-actions · 2023-11-22T13:57:17Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004826 / 0.011353 (-0.006527)	0.002979 / 0.011008 (-0.008029)	0.062055 / 0.038508 (0.023547)	0.056574 / 0.023109 (0.033465)	0.244342 / 0.275898 (-0.031556)	0.278040 / 0.323480 (-0.045439)	0.004020 / 0.007986 (-0.003965)	0.002474 / 0.004328 (-0.001855)	0.048451 / 0.004250 (0.044200)	0.038633 / 0.037052 (0.001580)	0.251389 / 0.258489 (-0.007100)	0.282739 / 0.293841 (-0.011102)	0.023298 / 0.128546 (-0.105248)	0.007513 / 0.075646 (-0.068134)	0.203014 / 0.419271 (-0.216257)	0.036216 / 0.043533 (-0.007317)	0.250988 / 0.255139 (-0.004151)	0.281228 / 0.283200 (-0.001972)	0.018259 / 0.141683 (-0.123424)	1.121200 / 1.452155 (-0.330955)	1.184298 / 1.492716 (-0.308419)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093730 / 0.018006 (0.075724)	0.301716 / 0.000490 (0.301226)	0.000223 / 0.000200 (0.000023)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019238 / 0.037411 (-0.018173)	0.064329 / 0.014526 (0.049803)	0.075657 / 0.176557 (-0.100899)	0.122616 / 0.737135 (-0.614519)	0.077459 / 0.296338 (-0.218880)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.280153 / 0.215209 (0.064944)	2.715488 / 2.077655 (0.637833)	1.449666 / 1.504120 (-0.054454)	1.331903 / 1.541195 (-0.209292)	1.396200 / 1.468490 (-0.072290)	0.398861 / 4.584777 (-4.185916)	2.402814 / 3.745712 (-1.342898)	2.664033 / 5.269862 (-2.605829)	1.619589 / 4.565676 (-2.946088)	0.044798 / 0.424275 (-0.379477)	0.004989 / 0.007607 (-0.002618)	0.336822 / 0.226044 (0.110777)	3.245604 / 2.268929 (0.976676)	1.815633 / 55.444624 (-53.628991)	1.557975 / 6.876477 (-5.318501)	1.603655 / 2.142072 (-0.538417)	0.462980 / 4.805227 (-4.342247)	0.098340 / 6.500664 (-6.402324)	0.042750 / 0.075469 (-0.032719)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.973785 / 1.841788 (-0.868003)	12.379356 / 8.074308 (4.305048)	10.540164 / 10.191392 (0.348772)	0.144803 / 0.680424 (-0.535621)	0.013875 / 0.534201 (-0.520326)	0.270192 / 0.579283 (-0.309091)	0.264614 / 0.434364 (-0.169750)	0.313454 / 0.540337 (-0.226883)	0.402310 / 1.386936 (-0.984626)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004987 / 0.011353 (-0.006366)	0.003017 / 0.011008 (-0.007992)	0.048592 / 0.038508 (0.010084)	0.059370 / 0.023109 (0.036261)	0.277536 / 0.275898 (0.001638)	0.300592 / 0.323480 (-0.022888)	0.004870 / 0.007986 (-0.003115)	0.002452 / 0.004328 (-0.001876)	0.047972 / 0.004250 (0.043721)	0.042336 / 0.037052 (0.005283)	0.277570 / 0.258489 (0.019081)	0.304739 / 0.293841 (0.010898)	0.025313 / 0.128546 (-0.103233)	0.007219 / 0.075646 (-0.068427)	0.053967 / 0.419271 (-0.365304)	0.033314 / 0.043533 (-0.010219)	0.273908 / 0.255139 (0.018769)	0.291913 / 0.283200 (0.008713)	0.019440 / 0.141683 (-0.122243)	1.111047 / 1.452155 (-0.341107)	1.191276 / 1.492716 (-0.301440)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093985 / 0.018006 (0.075979)	0.303105 / 0.000490 (0.302615)	0.000235 / 0.000200 (0.000035)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022226 / 0.037411 (-0.015186)	0.072151 / 0.014526 (0.057625)	0.081700 / 0.176557 (-0.094857)	0.121407 / 0.737135 (-0.615729)	0.083217 / 0.296338 (-0.213121)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.297286 / 0.215209 (0.082077)	2.913392 / 2.077655 (0.835738)	1.591758 / 1.504120 (0.087638)	1.463339 / 1.541195 (-0.077856)	1.495095 / 1.468490 (0.026605)	0.414341 / 4.584777 (-4.170436)	2.412438 / 3.745712 (-1.333275)	2.611452 / 5.269862 (-2.658410)	1.658545 / 4.565676 (-2.907132)	0.047269 / 0.424275 (-0.377007)	0.004872 / 0.007607 (-0.002735)	0.350746 / 0.226044 (0.124701)	3.491482 / 2.268929 (1.222554)	1.999009 / 55.444624 (-53.445616)	1.672862 / 6.876477 (-5.203615)	1.863095 / 2.142072 (-0.278977)	0.484746 / 4.805227 (-4.320481)	0.100774 / 6.500664 (-6.399890)	0.042519 / 0.075469 (-0.032950)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.984497 / 1.841788 (-0.857291)	12.972576 / 8.074308 (4.898268)	10.886021 / 10.191392 (0.694629)	0.141639 / 0.680424 (-0.538785)	0.015726 / 0.534201 (-0.518475)	0.284160 / 0.579283 (-0.295123)	0.291437 / 0.434364 (-0.142927)	0.314121 / 0.540337 (-0.226217)	0.420439 / 1.386936 (-0.966497)

github-actions · 2023-11-22T14:10:07Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004881 / 0.011353 (-0.006472)	0.002550 / 0.011008 (-0.008458)	0.062171 / 0.038508 (0.023663)	0.055341 / 0.023109 (0.032232)	0.243132 / 0.275898 (-0.032766)	0.265174 / 0.323480 (-0.058306)	0.002934 / 0.007986 (-0.005052)	0.002233 / 0.004328 (-0.002096)	0.049302 / 0.004250 (0.045052)	0.039491 / 0.037052 (0.002439)	0.252776 / 0.258489 (-0.005713)	0.280923 / 0.293841 (-0.012918)	0.022585 / 0.128546 (-0.105962)	0.006888 / 0.075646 (-0.068759)	0.202751 / 0.419271 (-0.216521)	0.035250 / 0.043533 (-0.008283)	0.251745 / 0.255139 (-0.003394)	0.267431 / 0.283200 (-0.015768)	0.019486 / 0.141683 (-0.122197)	1.161783 / 1.452155 (-0.290372)	1.194254 / 1.492716 (-0.298463)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097772 / 0.018006 (0.079766)	0.309137 / 0.000490 (0.308647)	0.000225 / 0.000200 (0.000025)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018719 / 0.037411 (-0.018693)	0.062211 / 0.014526 (0.047686)	0.074291 / 0.176557 (-0.102266)	0.119436 / 0.737135 (-0.617699)	0.075519 / 0.296338 (-0.220820)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.279778 / 0.215209 (0.064569)	2.730678 / 2.077655 (0.653023)	1.413922 / 1.504120 (-0.090198)	1.286747 / 1.541195 (-0.254447)	1.299835 / 1.468490 (-0.168656)	0.392516 / 4.584777 (-4.192261)	2.381816 / 3.745712 (-1.363896)	2.616944 / 5.269862 (-2.652918)	1.606152 / 4.565676 (-2.959525)	0.044867 / 0.424275 (-0.379408)	0.004915 / 0.007607 (-0.002692)	0.334078 / 0.226044 (0.108034)	3.388096 / 2.268929 (1.119167)	1.756666 / 55.444624 (-53.687958)	1.497211 / 6.876477 (-5.379266)	1.496787 / 2.142072 (-0.645285)	0.469145 / 4.805227 (-4.336082)	0.097821 / 6.500664 (-6.402843)	0.041850 / 0.075469 (-0.033619)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.956878 / 1.841788 (-0.884910)	11.520184 / 8.074308 (3.445875)	10.659216 / 10.191392 (0.467824)	0.143687 / 0.680424 (-0.536737)	0.014118 / 0.534201 (-0.520083)	0.270990 / 0.579283 (-0.308293)	0.270057 / 0.434364 (-0.164306)	0.311109 / 0.540337 (-0.229229)	0.407042 / 1.386936 (-0.979894)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004816 / 0.011353 (-0.006537)	0.002898 / 0.011008 (-0.008110)	0.048540 / 0.038508 (0.010032)	0.055286 / 0.023109 (0.032176)	0.279086 / 0.275898 (0.003187)	0.298950 / 0.323480 (-0.024529)	0.004090 / 0.007986 (-0.003896)	0.002497 / 0.004328 (-0.001832)	0.049160 / 0.004250 (0.044910)	0.040612 / 0.037052 (0.003560)	0.287832 / 0.258489 (0.029343)	0.305617 / 0.293841 (0.011776)	0.023936 / 0.128546 (-0.104610)	0.007565 / 0.075646 (-0.068081)	0.054037 / 0.419271 (-0.365235)	0.032389 / 0.043533 (-0.011144)	0.283031 / 0.255139 (0.027892)	0.295411 / 0.283200 (0.012212)	0.018466 / 0.141683 (-0.123217)	1.134660 / 1.452155 (-0.317495)	1.196212 / 1.492716 (-0.296504)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.099961 / 0.018006 (0.081955)	0.310831 / 0.000490 (0.310342)	0.000238 / 0.000200 (0.000038)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021566 / 0.037411 (-0.015845)	0.070255 / 0.014526 (0.055729)	0.081221 / 0.176557 (-0.095336)	0.119404 / 0.737135 (-0.617732)	0.083005 / 0.296338 (-0.213333)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.302788 / 0.215209 (0.087579)	2.928876 / 2.077655 (0.851221)	1.601221 / 1.504120 (0.097101)	1.485147 / 1.541195 (-0.056047)	1.508698 / 1.468490 (0.040207)	0.402783 / 4.584777 (-4.181994)	2.432151 / 3.745712 (-1.313561)	2.476848 / 5.269862 (-2.793013)	1.585487 / 4.565676 (-2.980189)	0.045965 / 0.424275 (-0.378310)	0.004818 / 0.007607 (-0.002789)	0.354847 / 0.226044 (0.128803)	3.500670 / 2.268929 (1.231742)	1.951904 / 55.444624 (-53.492720)	1.675152 / 6.876477 (-5.201325)	1.795971 / 2.142072 (-0.346101)	0.470625 / 4.805227 (-4.334602)	0.126080 / 6.500664 (-6.374584)	0.040506 / 0.075469 (-0.034963)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.985251 / 1.841788 (-0.856536)	12.316710 / 8.074308 (4.242402)	10.674437 / 10.191392 (0.483045)	0.133622 / 0.680424 (-0.546802)	0.016756 / 0.534201 (-0.517445)	0.269318 / 0.579283 (-0.309965)	0.282258 / 0.434364 (-0.152106)	0.309941 / 0.540337 (-0.230396)	0.403189 / 1.386936 (-0.983747)

albertvillanova · 2023-11-22T15:12:29Z

I am merging this PR because we need it by datasets-server.

github-actions · 2023-11-22T15:18:49Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004935 / 0.011353 (-0.006418)	0.002643 / 0.011008 (-0.008365)	0.064449 / 0.038508 (0.025941)	0.053110 / 0.023109 (0.030001)	0.261576 / 0.275898 (-0.014322)	0.270866 / 0.323480 (-0.052614)	0.002895 / 0.007986 (-0.005091)	0.002349 / 0.004328 (-0.001979)	0.047620 / 0.004250 (0.043370)	0.038699 / 0.037052 (0.001647)	0.246663 / 0.258489 (-0.011826)	0.282021 / 0.293841 (-0.011820)	0.022807 / 0.128546 (-0.105739)	0.007242 / 0.075646 (-0.068404)	0.204236 / 0.419271 (-0.215035)	0.035429 / 0.043533 (-0.008104)	0.241684 / 0.255139 (-0.013455)	0.262343 / 0.283200 (-0.020857)	0.020036 / 0.141683 (-0.121647)	1.112687 / 1.452155 (-0.339467)	1.167086 / 1.492716 (-0.325630)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.107059 / 0.018006 (0.089053)	0.301036 / 0.000490 (0.300546)	0.000224 / 0.000200 (0.000024)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018464 / 0.037411 (-0.018947)	0.063822 / 0.014526 (0.049296)	0.073562 / 0.176557 (-0.102994)	0.120136 / 0.737135 (-0.616999)	0.074934 / 0.296338 (-0.221405)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.275474 / 0.215209 (0.060265)	2.714239 / 2.077655 (0.636584)	1.455535 / 1.504120 (-0.048585)	1.336530 / 1.541195 (-0.204665)	1.359607 / 1.468490 (-0.108883)	0.396303 / 4.584777 (-4.188474)	2.366076 / 3.745712 (-1.379636)	2.600755 / 5.269862 (-2.669107)	1.572382 / 4.565676 (-2.993294)	0.045795 / 0.424275 (-0.378480)	0.004932 / 0.007607 (-0.002675)	0.332175 / 0.226044 (0.106130)	3.257843 / 2.268929 (0.988915)	1.799021 / 55.444624 (-53.645603)	1.532813 / 6.876477 (-5.343663)	1.552279 / 2.142072 (-0.589794)	0.471369 / 4.805227 (-4.333858)	0.098931 / 6.500664 (-6.401733)	0.042735 / 0.075469 (-0.032734)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.960779 / 1.841788 (-0.881009)	11.741631 / 8.074308 (3.667322)	10.355721 / 10.191392 (0.164329)	0.129025 / 0.680424 (-0.551399)	0.013794 / 0.534201 (-0.520407)	0.267268 / 0.579283 (-0.312015)	0.265582 / 0.434364 (-0.168782)	0.306242 / 0.540337 (-0.234095)	0.400367 / 1.386936 (-0.986569)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004966 / 0.011353 (-0.006387)	0.002846 / 0.011008 (-0.008163)	0.049104 / 0.038508 (0.010596)	0.055436 / 0.023109 (0.032327)	0.273892 / 0.275898 (-0.002006)	0.300207 / 0.323480 (-0.023273)	0.004017 / 0.007986 (-0.003969)	0.002465 / 0.004328 (-0.001863)	0.048088 / 0.004250 (0.043837)	0.040037 / 0.037052 (0.002984)	0.279918 / 0.258489 (0.021429)	0.305378 / 0.293841 (0.011537)	0.024326 / 0.128546 (-0.104220)	0.006992 / 0.075646 (-0.068654)	0.053545 / 0.419271 (-0.365726)	0.032312 / 0.043533 (-0.011221)	0.272899 / 0.255139 (0.017760)	0.289683 / 0.283200 (0.006483)	0.019121 / 0.141683 (-0.122562)	1.133296 / 1.452155 (-0.318858)	1.220989 / 1.492716 (-0.271728)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093193 / 0.018006 (0.075187)	0.307658 / 0.000490 (0.307168)	0.000224 / 0.000200 (0.000024)	0.000045 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022906 / 0.037411 (-0.014506)	0.080931 / 0.014526 (0.066405)	0.081442 / 0.176557 (-0.095115)	0.121150 / 0.737135 (-0.615986)	0.083387 / 0.296338 (-0.212952)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294979 / 0.215209 (0.079770)	2.900090 / 2.077655 (0.822435)	1.610061 / 1.504120 (0.105941)	1.455118 / 1.541195 (-0.086077)	1.456599 / 1.468490 (-0.011891)	0.397919 / 4.584777 (-4.186858)	2.421010 / 3.745712 (-1.324702)	2.486527 / 5.269862 (-2.783334)	1.573854 / 4.565676 (-2.991822)	0.046199 / 0.424275 (-0.378076)	0.004888 / 0.007607 (-0.002719)	0.342183 / 0.226044 (0.116139)	3.392068 / 2.268929 (1.123140)	1.963688 / 55.444624 (-53.480936)	1.667611 / 6.876477 (-5.208866)	1.833706 / 2.142072 (-0.308367)	0.509421 / 4.805227 (-4.295806)	0.099669 / 6.500664 (-6.400995)	0.041004 / 0.075469 (-0.034465)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.956314 / 1.841788 (-0.885474)	12.190194 / 8.074308 (4.115886)	10.417839 / 10.191392 (0.226447)	0.144139 / 0.680424 (-0.536285)	0.015841 / 0.534201 (-0.518359)	0.270436 / 0.579283 (-0.308847)	0.273952 / 0.434364 (-0.160412)	0.303018 / 0.540337 (-0.237319)	0.410163 / 1.386936 (-0.976773)

albertvillanova added 5 commits November 16, 2023 16:59

Create DatasetNotFoundError and DataFilesNotFoundError

4ed512a

Raise DatasetNotFoundError and DataFilesNotFoundError

f2b6c3a

Catch and re-raise DatasetNotFoundError and DataFilesNotFoundError

2415efb

Fix docstring

a70812b

Fix style

cf4ba6f

albertvillanova commented Nov 16, 2023

View reviewed changes

src/datasets/load.py Outdated Show resolved Hide resolved

severo reviewed Nov 16, 2023

View reviewed changes

src/datasets/exceptions.py Show resolved Hide resolved

Re-raise e1

6f3f3e3

severo approved these changes Nov 17, 2023

View reviewed changes

Fix tests

bf8fa7a

mariosasko reviewed Nov 21, 2023

View reviewed changes

Merge remote-tracking branch 'upstream/main' into fix-6397

87ad7c7

Remove mention to script from DataFilesNotFoundError

08ceb92

albertvillanova merged commit aa8558f into main Nov 22, 2023

albertvillanova deleted the fix-6397 branch November 22, 2023 15:12

severo mentioned this pull request Nov 23, 2023

Improve error message when no suitable file is found huggingface/dataset-viewer#2082

Closed

lhoestq mentioned this pull request Nov 30, 2023

Missing DatasetNotFoundError #6462

Merged

Create DatasetNotFoundError and DataFilesNotFoundError #6431

Create DatasetNotFoundError and DataFilesNotFoundError #6431

Uh oh!

Conversation

albertvillanova commented Nov 16, 2023

Uh oh!

Uh oh!

github-actions bot commented Nov 16, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 17, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 20, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

albertvillanova commented Nov 20, 2023

Uh oh!

severo commented Nov 20, 2023

Uh oh!

mariosasko Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

albertvillanova Nov 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertvillanova Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot commented Nov 22, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 22, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

HuggingFaceDocBuilderDev commented Nov 16, 2023 •

edited

Loading

albertvillanova Nov 21, 2023 •

edited

Loading