Lazy data files resolution #6458

lhoestq · 2023-11-29T13:18:44Z

Related to discussion at #6255

this makes this code run in 2sec instead of >10sec

from datasets import load_dataset

ds = load_dataset("glue", "sst2", streaming=True, trust_remote_code=False)

For some datasets with many configs and files it can be up to 100x faster.
This is particularly important now that some datasets will be loaded from the Parquet export instead of the scripts.

The data files are only resolved in the builder __init__. To do so I added DataFilesPatternsList and DataFilesPatternsDict that have .resolve() to return resolved DataFilesList and DataFilesDict

github-actions · 2023-11-29T13:23:07Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005097 / 0.011353 (-0.006256)	0.003523 / 0.011008 (-0.007485)	0.062827 / 0.038508 (0.024319)	0.051677 / 0.023109 (0.028568)	0.248919 / 0.275898 (-0.026980)	0.275892 / 0.323480 (-0.047588)	0.003908 / 0.007986 (-0.004077)	0.002622 / 0.004328 (-0.001706)	0.048634 / 0.004250 (0.044383)	0.037903 / 0.037052 (0.000850)	0.255754 / 0.258489 (-0.002735)	0.283343 / 0.293841 (-0.010498)	0.027886 / 0.128546 (-0.100660)	0.010849 / 0.075646 (-0.064797)	0.208255 / 0.419271 (-0.211017)	0.035664 / 0.043533 (-0.007869)	0.254661 / 0.255139 (-0.000478)	0.274366 / 0.283200 (-0.008834)	0.017240 / 0.141683 (-0.124443)	1.092952 / 1.452155 (-0.359203)	1.148373 / 1.492716 (-0.344344)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091592 / 0.018006 (0.073586)	0.301926 / 0.000490 (0.301436)	0.000207 / 0.000200 (0.000007)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018525 / 0.037411 (-0.018887)	0.060539 / 0.014526 (0.046014)	0.073812 / 0.176557 (-0.102745)	0.120655 / 0.737135 (-0.616480)	0.076931 / 0.296338 (-0.219407)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.282797 / 0.215209 (0.067588)	2.746573 / 2.077655 (0.668918)	1.477652 / 1.504120 (-0.026468)	1.349922 / 1.541195 (-0.191273)	1.374347 / 1.468490 (-0.094143)	0.574096 / 4.584777 (-4.010681)	2.383317 / 3.745712 (-1.362395)	2.809320 / 5.269862 (-2.460541)	1.758947 / 4.565676 (-2.806729)	0.064029 / 0.424275 (-0.360246)	0.004936 / 0.007607 (-0.002672)	0.331403 / 0.226044 (0.105358)	3.260908 / 2.268929 (0.991980)	1.817670 / 55.444624 (-53.626954)	1.525863 / 6.876477 (-5.350613)	1.542017 / 2.142072 (-0.600055)	0.638900 / 4.805227 (-4.166327)	0.119485 / 6.500664 (-6.381179)	0.042588 / 0.075469 (-0.032881)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.951583 / 1.841788 (-0.890205)	11.621917 / 8.074308 (3.547609)	10.511062 / 10.191392 (0.319670)	0.130137 / 0.680424 (-0.550287)	0.014048 / 0.534201 (-0.520153)	0.290621 / 0.579283 (-0.288662)	0.271665 / 0.434364 (-0.162699)	0.331260 / 0.540337 (-0.209077)	0.441621 / 1.386936 (-0.945316)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005272 / 0.011353 (-0.006081)	0.003656 / 0.011008 (-0.007352)	0.049245 / 0.038508 (0.010737)	0.054130 / 0.023109 (0.031021)	0.274775 / 0.275898 (-0.001123)	0.296664 / 0.323480 (-0.026816)	0.004870 / 0.007986 (-0.003115)	0.002728 / 0.004328 (-0.001601)	0.048087 / 0.004250 (0.043837)	0.041448 / 0.037052 (0.004396)	0.279110 / 0.258489 (0.020621)	0.303660 / 0.293841 (0.009819)	0.029767 / 0.128546 (-0.098779)	0.010799 / 0.075646 (-0.064848)	0.058650 / 0.419271 (-0.360622)	0.033088 / 0.043533 (-0.010445)	0.274456 / 0.255139 (0.019317)	0.290206 / 0.283200 (0.007007)	0.017259 / 0.141683 (-0.124424)	1.176501 / 1.452155 (-0.275654)	1.197552 / 1.492716 (-0.295165)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092865 / 0.018006 (0.074859)	0.302437 / 0.000490 (0.301947)	0.000209 / 0.000200 (0.000009)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021211 / 0.037411 (-0.016200)	0.068858 / 0.014526 (0.054332)	0.081783 / 0.176557 (-0.094773)	0.120472 / 0.737135 (-0.616663)	0.083900 / 0.296338 (-0.212438)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.295157 / 0.215209 (0.079948)	2.910979 / 2.077655 (0.833324)	1.575772 / 1.504120 (0.071652)	1.456955 / 1.541195 (-0.084239)	1.468982 / 1.468490 (0.000492)	0.560309 / 4.584777 (-4.024468)	2.460171 / 3.745712 (-1.285541)	2.805713 / 5.269862 (-2.464149)	1.754074 / 4.565676 (-2.811603)	0.063333 / 0.424275 (-0.360942)	0.004940 / 0.007607 (-0.002667)	0.346141 / 0.226044 (0.120097)	3.463431 / 2.268929 (1.194502)	1.929135 / 55.444624 (-53.515490)	1.660191 / 6.876477 (-5.216286)	1.668327 / 2.142072 (-0.473746)	0.644183 / 4.805227 (-4.161044)	0.115738 / 6.500664 (-6.384926)	0.041347 / 0.075469 (-0.034122)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.961565 / 1.841788 (-0.880222)	12.232589 / 8.074308 (4.158281)	10.778774 / 10.191392 (0.587382)	0.132709 / 0.680424 (-0.547715)	0.015964 / 0.534201 (-0.518237)	0.286944 / 0.579283 (-0.292340)	0.279740 / 0.434364 (-0.154624)	0.333024 / 0.540337 (-0.207314)	0.438819 / 1.386936 (-0.948117)

HuggingFaceDocBuilderDev · 2023-11-29T13:23:50Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

src/datasets/data_files.py

github-actions · 2023-11-29T15:29:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005317 / 0.011353 (-0.006036)	0.003936 / 0.011008 (-0.007072)	0.063122 / 0.038508 (0.024614)	0.061274 / 0.023109 (0.038165)	0.251764 / 0.275898 (-0.024134)	0.274849 / 0.323480 (-0.048631)	0.004059 / 0.007986 (-0.003927)	0.002874 / 0.004328 (-0.001455)	0.048716 / 0.004250 (0.044465)	0.038281 / 0.037052 (0.001228)	0.265224 / 0.258489 (0.006735)	0.285962 / 0.293841 (-0.007878)	0.028522 / 0.128546 (-0.100024)	0.011150 / 0.075646 (-0.064496)	0.208362 / 0.419271 (-0.210910)	0.038900 / 0.043533 (-0.004633)	0.254113 / 0.255139 (-0.001026)	0.276721 / 0.283200 (-0.006478)	0.018372 / 0.141683 (-0.123311)	1.121336 / 1.452155 (-0.330818)	1.189548 / 1.492716 (-0.303168)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097633 / 0.018006 (0.079627)	0.304443 / 0.000490 (0.303953)	0.000218 / 0.000200 (0.000018)	0.000054 / 0.000054 (-0.000001)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021757 / 0.037411 (-0.015654)	0.061978 / 0.014526 (0.047453)	0.076296 / 0.176557 (-0.100260)	0.122320 / 0.737135 (-0.614816)	0.076738 / 0.296338 (-0.219601)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.284328 / 0.215209 (0.069119)	2.793071 / 2.077655 (0.715417)	1.504768 / 1.504120 (0.000648)	1.386083 / 1.541195 (-0.155111)	1.457593 / 1.468490 (-0.010897)	0.575887 / 4.584777 (-4.008890)	2.419396 / 3.745712 (-1.326316)	2.931305 / 5.269862 (-2.338556)	1.840759 / 4.565676 (-2.724917)	0.063801 / 0.424275 (-0.360474)	0.004966 / 0.007607 (-0.002641)	0.341612 / 0.226044 (0.115568)	3.402842 / 2.268929 (1.133913)	1.860521 / 55.444624 (-53.584103)	1.603156 / 6.876477 (-5.273321)	1.665835 / 2.142072 (-0.476237)	0.655299 / 4.805227 (-4.149929)	0.124527 / 6.500664 (-6.376137)	0.044021 / 0.075469 (-0.031449)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.972068 / 1.841788 (-0.869720)	12.393202 / 8.074308 (4.318894)	10.420876 / 10.191392 (0.229484)	0.140684 / 0.680424 (-0.539740)	0.014442 / 0.534201 (-0.519759)	0.288182 / 0.579283 (-0.291101)	0.265029 / 0.434364 (-0.169334)	0.327133 / 0.540337 (-0.213204)	0.443403 / 1.386936 (-0.943533)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005559 / 0.011353 (-0.005794)	0.004046 / 0.011008 (-0.006962)	0.048991 / 0.038508 (0.010483)	0.059576 / 0.023109 (0.036467)	0.273596 / 0.275898 (-0.002302)	0.296658 / 0.323480 (-0.026822)	0.004089 / 0.007986 (-0.003897)	0.002777 / 0.004328 (-0.001551)	0.048216 / 0.004250 (0.043966)	0.043200 / 0.037052 (0.006148)	0.276815 / 0.258489 (0.018326)	0.300570 / 0.293841 (0.006729)	0.030250 / 0.128546 (-0.098296)	0.011322 / 0.075646 (-0.064324)	0.057843 / 0.419271 (-0.361429)	0.033366 / 0.043533 (-0.010167)	0.275636 / 0.255139 (0.020497)	0.293750 / 0.283200 (0.010550)	0.018551 / 0.141683 (-0.123132)	1.160919 / 1.452155 (-0.291236)	1.214519 / 1.492716 (-0.278197)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.100074 / 0.018006 (0.082068)	0.308434 / 0.000490 (0.307944)	0.000232 / 0.000200 (0.000032)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022600 / 0.037411 (-0.014811)	0.070506 / 0.014526 (0.055980)	0.081185 / 0.176557 (-0.095371)	0.120688 / 0.737135 (-0.616448)	0.082897 / 0.296338 (-0.213441)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.306661 / 0.215209 (0.091452)	2.989656 / 2.077655 (0.912001)	1.618868 / 1.504120 (0.114749)	1.485045 / 1.541195 (-0.056149)	1.549359 / 1.468490 (0.080869)	0.593596 / 4.584777 (-3.991181)	2.466215 / 3.745712 (-1.279497)	2.956570 / 5.269862 (-2.313292)	1.823160 / 4.565676 (-2.742516)	0.063442 / 0.424275 (-0.360833)	0.004928 / 0.007607 (-0.002679)	0.358464 / 0.226044 (0.132419)	3.566345 / 2.268929 (1.297417)	2.006784 / 55.444624 (-53.437840)	1.687091 / 6.876477 (-5.189386)	1.729464 / 2.142072 (-0.412609)	0.655656 / 4.805227 (-4.149572)	0.119044 / 6.500664 (-6.381620)	0.042782 / 0.075469 (-0.032687)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.974937 / 1.841788 (-0.866850)	12.992888 / 8.074308 (4.918580)	10.893713 / 10.191392 (0.702321)	0.133853 / 0.680424 (-0.546570)	0.016055 / 0.534201 (-0.518145)	0.289342 / 0.579283 (-0.289941)	0.286094 / 0.434364 (-0.148270)	0.328670 / 0.540337 (-0.211667)	0.444605 / 1.386936 (-0.942331)

github-actions · 2023-11-29T15:31:05Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005705 / 0.011353 (-0.005648)	0.003519 / 0.011008 (-0.007489)	0.062009 / 0.038508 (0.023501)	0.053481 / 0.023109 (0.030372)	0.262669 / 0.275898 (-0.013229)	0.280290 / 0.323480 (-0.043189)	0.002957 / 0.007986 (-0.005029)	0.002587 / 0.004328 (-0.001741)	0.047876 / 0.004250 (0.043626)	0.038868 / 0.037052 (0.001815)	0.267854 / 0.258489 (0.009365)	0.290430 / 0.293841 (-0.003411)	0.028120 / 0.128546 (-0.100427)	0.011042 / 0.075646 (-0.064605)	0.206113 / 0.419271 (-0.213158)	0.036039 / 0.043533 (-0.007494)	0.257715 / 0.255139 (0.002576)	0.281279 / 0.283200 (-0.001921)	0.019790 / 0.141683 (-0.121893)	1.114472 / 1.452155 (-0.337683)	1.192219 / 1.492716 (-0.300497)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091049 / 0.018006 (0.073043)	0.300846 / 0.000490 (0.300356)	0.000208 / 0.000200 (0.000008)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018569 / 0.037411 (-0.018843)	0.060075 / 0.014526 (0.045549)	0.073877 / 0.176557 (-0.102680)	0.120337 / 0.737135 (-0.616799)	0.075454 / 0.296338 (-0.220884)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.290084 / 0.215209 (0.074875)	2.805712 / 2.077655 (0.728057)	1.459393 / 1.504120 (-0.044727)	1.327356 / 1.541195 (-0.213838)	1.384734 / 1.468490 (-0.083756)	0.574532 / 4.584777 (-4.010245)	2.419696 / 3.745712 (-1.326016)	2.805449 / 5.269862 (-2.464412)	1.764127 / 4.565676 (-2.801549)	0.063256 / 0.424275 (-0.361020)	0.004954 / 0.007607 (-0.002653)	0.344246 / 0.226044 (0.118202)	3.396050 / 2.268929 (1.127121)	1.807621 / 55.444624 (-53.637004)	1.536627 / 6.876477 (-5.339850)	1.552450 / 2.142072 (-0.589623)	0.651156 / 4.805227 (-4.154071)	0.119358 / 6.500664 (-6.381306)	0.042810 / 0.075469 (-0.032660)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.930646 / 1.841788 (-0.911142)	11.830454 / 8.074308 (3.756146)	10.615315 / 10.191392 (0.423923)	0.130617 / 0.680424 (-0.549807)	0.014081 / 0.534201 (-0.520120)	0.285027 / 0.579283 (-0.294256)	0.267296 / 0.434364 (-0.167068)	0.331478 / 0.540337 (-0.208859)	0.442676 / 1.386936 (-0.944260)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005340 / 0.011353 (-0.006013)	0.003745 / 0.011008 (-0.007264)	0.049011 / 0.038508 (0.010503)	0.051342 / 0.023109 (0.028233)	0.272482 / 0.275898 (-0.003416)	0.292816 / 0.323480 (-0.030663)	0.003977 / 0.007986 (-0.004008)	0.002642 / 0.004328 (-0.001687)	0.048213 / 0.004250 (0.043963)	0.040341 / 0.037052 (0.003289)	0.275176 / 0.258489 (0.016687)	0.301098 / 0.293841 (0.007257)	0.029052 / 0.128546 (-0.099495)	0.010796 / 0.075646 (-0.064850)	0.057654 / 0.419271 (-0.361618)	0.032914 / 0.043533 (-0.010619)	0.271235 / 0.255139 (0.016096)	0.289883 / 0.283200 (0.006684)	0.018548 / 0.141683 (-0.123135)	1.134072 / 1.452155 (-0.318083)	1.208228 / 1.492716 (-0.284488)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.094524 / 0.018006 (0.076518)	0.310162 / 0.000490 (0.309672)	0.000237 / 0.000200 (0.000037)	0.000057 / 0.000054 (0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021090 / 0.037411 (-0.016321)	0.068351 / 0.014526 (0.053825)	0.082370 / 0.176557 (-0.094186)	0.121648 / 0.737135 (-0.615487)	0.083433 / 0.296338 (-0.212906)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294616 / 0.215209 (0.079407)	2.894194 / 2.077655 (0.816539)	1.619739 / 1.504120 (0.115619)	1.492466 / 1.541195 (-0.048729)	1.511662 / 1.468490 (0.043172)	0.557179 / 4.584777 (-4.027597)	2.400669 / 3.745712 (-1.345043)	2.781363 / 5.269862 (-2.488499)	1.769144 / 4.565676 (-2.796533)	0.063996 / 0.424275 (-0.360279)	0.004922 / 0.007607 (-0.002685)	0.354483 / 0.226044 (0.128438)	3.474795 / 2.268929 (1.205867)	1.985743 / 55.444624 (-53.458881)	1.693173 / 6.876477 (-5.183303)	1.695857 / 2.142072 (-0.446216)	0.654800 / 4.805227 (-4.150427)	0.117316 / 6.500664 (-6.383348)	0.040708 / 0.075469 (-0.034761)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.977678 / 1.841788 (-0.864109)	12.214098 / 8.074308 (4.139790)	10.741857 / 10.191392 (0.550465)	0.130308 / 0.680424 (-0.550116)	0.015053 / 0.534201 (-0.519148)	0.295496 / 0.579283 (-0.283787)	0.276348 / 0.434364 (-0.158015)	0.326568 / 0.540337 (-0.213769)	0.441902 / 1.386936 (-0.945034)

github-actions · 2023-11-29T15:35:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005218 / 0.011353 (-0.006135)	0.003270 / 0.011008 (-0.007738)	0.062380 / 0.038508 (0.023872)	0.052896 / 0.023109 (0.029787)	0.233060 / 0.275898 (-0.042838)	0.259194 / 0.323480 (-0.064286)	0.002880 / 0.007986 (-0.005106)	0.002643 / 0.004328 (-0.001686)	0.048084 / 0.004250 (0.043833)	0.038807 / 0.037052 (0.001755)	0.244925 / 0.258489 (-0.013564)	0.269619 / 0.293841 (-0.024222)	0.026901 / 0.128546 (-0.101646)	0.010150 / 0.075646 (-0.065497)	0.206854 / 0.419271 (-0.212417)	0.035618 / 0.043533 (-0.007915)	0.239577 / 0.255139 (-0.015562)	0.259684 / 0.283200 (-0.023516)	0.019823 / 0.141683 (-0.121860)	1.074472 / 1.452155 (-0.377682)	1.142911 / 1.492716 (-0.349805)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092616 / 0.018006 (0.074610)	0.301974 / 0.000490 (0.301485)	0.000201 / 0.000200 (0.000002)	0.000048 / 0.000054 (-0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018864 / 0.037411 (-0.018548)	0.061007 / 0.014526 (0.046481)	0.073228 / 0.176557 (-0.103328)	0.120719 / 0.737135 (-0.616416)	0.075686 / 0.296338 (-0.220653)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.281404 / 0.215209 (0.066195)	2.777671 / 2.077655 (0.700017)	1.464689 / 1.504120 (-0.039431)	1.345357 / 1.541195 (-0.195838)	1.384273 / 1.468490 (-0.084217)	0.560298 / 4.584777 (-4.024479)	2.389877 / 3.745712 (-1.355835)	2.755564 / 5.269862 (-2.514297)	1.737754 / 4.565676 (-2.827922)	0.063025 / 0.424275 (-0.361251)	0.004975 / 0.007607 (-0.002632)	0.346741 / 0.226044 (0.120697)	3.321918 / 2.268929 (1.052989)	1.815700 / 55.444624 (-53.628924)	1.547333 / 6.876477 (-5.329144)	1.564809 / 2.142072 (-0.577263)	0.638645 / 4.805227 (-4.166582)	0.118157 / 6.500664 (-6.382507)	0.041605 / 0.075469 (-0.033864)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.942515 / 1.841788 (-0.899273)	11.400386 / 8.074308 (3.326078)	10.208763 / 10.191392 (0.017370)	0.138144 / 0.680424 (-0.542280)	0.014354 / 0.534201 (-0.519847)	0.288289 / 0.579283 (-0.290994)	0.265973 / 0.434364 (-0.168391)	0.327703 / 0.540337 (-0.212634)	0.435474 / 1.386936 (-0.951462)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005163 / 0.011353 (-0.006190)	0.003307 / 0.011008 (-0.007701)	0.048885 / 0.038508 (0.010377)	0.049044 / 0.023109 (0.025935)	0.261408 / 0.275898 (-0.014490)	0.284625 / 0.323480 (-0.038855)	0.003970 / 0.007986 (-0.004015)	0.002754 / 0.004328 (-0.001575)	0.048271 / 0.004250 (0.044021)	0.039849 / 0.037052 (0.002797)	0.266898 / 0.258489 (0.008409)	0.291445 / 0.293841 (-0.002396)	0.028477 / 0.128546 (-0.100069)	0.010656 / 0.075646 (-0.064990)	0.057732 / 0.419271 (-0.361539)	0.033298 / 0.043533 (-0.010235)	0.297773 / 0.255139 (0.042634)	0.281894 / 0.283200 (-0.001305)	0.018595 / 0.141683 (-0.123088)	1.168849 / 1.452155 (-0.283306)	1.183493 / 1.492716 (-0.309224)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092683 / 0.018006 (0.074677)	0.300387 / 0.000490 (0.299897)	0.000221 / 0.000200 (0.000021)	0.000052 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021356 / 0.037411 (-0.016055)	0.068095 / 0.014526 (0.053569)	0.079806 / 0.176557 (-0.096750)	0.118965 / 0.737135 (-0.618170)	0.082066 / 0.296338 (-0.214273)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.293105 / 0.215209 (0.077896)	2.842800 / 2.077655 (0.765146)	1.572052 / 1.504120 (0.067932)	1.450156 / 1.541195 (-0.091038)	1.464227 / 1.468490 (-0.004263)	0.561215 / 4.584777 (-4.023562)	2.456117 / 3.745712 (-1.289596)	2.739766 / 5.269862 (-2.530095)	1.730354 / 4.565676 (-2.835323)	0.062636 / 0.424275 (-0.361639)	0.004933 / 0.007607 (-0.002674)	0.345800 / 0.226044 (0.119756)	3.415858 / 2.268929 (1.146929)	1.937288 / 55.444624 (-53.507336)	1.661975 / 6.876477 (-5.214502)	1.660347 / 2.142072 (-0.481726)	0.642780 / 4.805227 (-4.162448)	0.116643 / 6.500664 (-6.384021)	0.041282 / 0.075469 (-0.034187)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.976629 / 1.841788 (-0.865159)	11.900319 / 8.074308 (3.826011)	10.574198 / 10.191392 (0.382806)	0.129689 / 0.680424 (-0.550735)	0.015390 / 0.534201 (-0.518811)	0.286543 / 0.579283 (-0.292741)	0.277676 / 0.434364 (-0.156688)	0.325053 / 0.540337 (-0.215284)	0.439663 / 1.386936 (-0.947274)

github-actions · 2023-11-29T15:55:16Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005382 / 0.011353 (-0.005971)	0.003606 / 0.011008 (-0.007402)	0.063234 / 0.038508 (0.024726)	0.053738 / 0.023109 (0.030629)	0.250405 / 0.275898 (-0.025493)	0.272244 / 0.323480 (-0.051236)	0.002896 / 0.007986 (-0.005090)	0.002684 / 0.004328 (-0.001644)	0.048394 / 0.004250 (0.044143)	0.039017 / 0.037052 (0.001964)	0.259554 / 0.258489 (0.001065)	0.287215 / 0.293841 (-0.006626)	0.028290 / 0.128546 (-0.100257)	0.011482 / 0.075646 (-0.064164)	0.214264 / 0.419271 (-0.205007)	0.036257 / 0.043533 (-0.007276)	0.252873 / 0.255139 (-0.002266)	0.271269 / 0.283200 (-0.011931)	0.017173 / 0.141683 (-0.124510)	1.137474 / 1.452155 (-0.314681)	1.161499 / 1.492716 (-0.331217)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092424 / 0.018006 (0.074418)	0.283703 / 0.000490 (0.283213)	0.000209 / 0.000200 (0.000009)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018307 / 0.037411 (-0.019105)	0.060780 / 0.014526 (0.046254)	0.073984 / 0.176557 (-0.102573)	0.120824 / 0.737135 (-0.616311)	0.074724 / 0.296338 (-0.221615)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.297682 / 0.215209 (0.082473)	2.853267 / 2.077655 (0.775612)	1.567643 / 1.504120 (0.063523)	1.437218 / 1.541195 (-0.103976)	1.467187 / 1.468490 (-0.001304)	0.560552 / 4.584777 (-4.024225)	2.387848 / 3.745712 (-1.357864)	2.718946 / 5.269862 (-2.550916)	1.724107 / 4.565676 (-2.841570)	0.061923 / 0.424275 (-0.362352)	0.004828 / 0.007607 (-0.002779)	0.353916 / 0.226044 (0.127871)	3.404477 / 2.268929 (1.135548)	1.906078 / 55.444624 (-53.538546)	1.629686 / 6.876477 (-5.246791)	1.640839 / 2.142072 (-0.501233)	0.641082 / 4.805227 (-4.164145)	0.118078 / 6.500664 (-6.382586)	0.041881 / 0.075469 (-0.033588)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.936062 / 1.841788 (-0.905726)	11.397678 / 8.074308 (3.323370)	10.385159 / 10.191392 (0.193766)	0.127337 / 0.680424 (-0.553087)	0.013562 / 0.534201 (-0.520639)	0.290817 / 0.579283 (-0.288466)	0.259377 / 0.434364 (-0.174987)	0.324829 / 0.540337 (-0.215508)	0.434344 / 1.386936 (-0.952592)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005134 / 0.011353 (-0.006219)	0.003404 / 0.011008 (-0.007604)	0.048281 / 0.038508 (0.009772)	0.050952 / 0.023109 (0.027842)	0.277553 / 0.275898 (0.001655)	0.298855 / 0.323480 (-0.024625)	0.003928 / 0.007986 (-0.004058)	0.002642 / 0.004328 (-0.001687)	0.047374 / 0.004250 (0.043123)	0.039883 / 0.037052 (0.002831)	0.279808 / 0.258489 (0.021318)	0.301604 / 0.293841 (0.007763)	0.028708 / 0.128546 (-0.099838)	0.010949 / 0.075646 (-0.064697)	0.057090 / 0.419271 (-0.362181)	0.032438 / 0.043533 (-0.011095)	0.274690 / 0.255139 (0.019551)	0.290912 / 0.283200 (0.007712)	0.017556 / 0.141683 (-0.124127)	1.111091 / 1.452155 (-0.341064)	1.166063 / 1.492716 (-0.326653)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090557 / 0.018006 (0.072551)	0.298661 / 0.000490 (0.298171)	0.000228 / 0.000200 (0.000028)	0.000045 / 0.000054 (-0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021712 / 0.037411 (-0.015699)	0.068682 / 0.014526 (0.054156)	0.080108 / 0.176557 (-0.096449)	0.119480 / 0.737135 (-0.617655)	0.082703 / 0.296338 (-0.213636)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294095 / 0.215209 (0.078886)	2.884758 / 2.077655 (0.807103)	1.598312 / 1.504120 (0.094192)	1.480050 / 1.541195 (-0.061145)	1.488611 / 1.468490 (0.020121)	0.556052 / 4.584777 (-4.028724)	2.435484 / 3.745712 (-1.310228)	2.741592 / 5.269862 (-2.528270)	1.706223 / 4.565676 (-2.859454)	0.062214 / 0.424275 (-0.362061)	0.004901 / 0.007607 (-0.002706)	0.346301 / 0.226044 (0.120257)	3.474516 / 2.268929 (1.205587)	1.995205 / 55.444624 (-53.449419)	1.726349 / 6.876477 (-5.150128)	1.659600 / 2.142072 (-0.482472)	0.643560 / 4.805227 (-4.161667)	0.115222 / 6.500664 (-6.385442)	0.041137 / 0.075469 (-0.034332)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.974566 / 1.841788 (-0.867221)	11.872479 / 8.074308 (3.798171)	10.496919 / 10.191392 (0.305527)	0.129087 / 0.680424 (-0.551337)	0.014627 / 0.534201 (-0.519574)	0.289070 / 0.579283 (-0.290213)	0.269609 / 0.434364 (-0.164755)	0.327785 / 0.540337 (-0.212553)	0.444634 / 1.386936 (-0.942302)

github-actions · 2023-11-29T17:10:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005080 / 0.011353 (-0.006273)	0.003782 / 0.011008 (-0.007226)	0.062816 / 0.038508 (0.024308)	0.056338 / 0.023109 (0.033229)	0.251317 / 0.275898 (-0.024581)	0.269414 / 0.323480 (-0.054066)	0.003984 / 0.007986 (-0.004001)	0.002749 / 0.004328 (-0.001580)	0.048126 / 0.004250 (0.043876)	0.038516 / 0.037052 (0.001464)	0.253809 / 0.258489 (-0.004680)	0.283309 / 0.293841 (-0.010532)	0.027015 / 0.128546 (-0.101531)	0.010610 / 0.075646 (-0.065037)	0.213024 / 0.419271 (-0.206247)	0.035734 / 0.043533 (-0.007799)	0.247909 / 0.255139 (-0.007230)	0.263539 / 0.283200 (-0.019660)	0.018408 / 0.141683 (-0.123275)	1.104366 / 1.452155 (-0.347789)	1.169668 / 1.492716 (-0.323048)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.114366 / 0.018006 (0.096360)	0.317674 / 0.000490 (0.317184)	0.000227 / 0.000200 (0.000027)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018955 / 0.037411 (-0.018457)	0.060716 / 0.014526 (0.046190)	0.072963 / 0.176557 (-0.103593)	0.121671 / 0.737135 (-0.615464)	0.073785 / 0.296338 (-0.222554)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.292349 / 0.215209 (0.077140)	2.832049 / 2.077655 (0.754394)	1.504488 / 1.504120 (0.000368)	1.403418 / 1.541195 (-0.137777)	1.449223 / 1.468490 (-0.019267)	0.563846 / 4.584777 (-4.020931)	2.376726 / 3.745712 (-1.368986)	2.823304 / 5.269862 (-2.446558)	1.774858 / 4.565676 (-2.790818)	0.063229 / 0.424275 (-0.361046)	0.004923 / 0.007607 (-0.002684)	0.347240 / 0.226044 (0.121195)	3.486563 / 2.268929 (1.217634)	1.890516 / 55.444624 (-53.554109)	1.570620 / 6.876477 (-5.305857)	1.600842 / 2.142072 (-0.541231)	0.644287 / 4.805227 (-4.160940)	0.116931 / 6.500664 (-6.383733)	0.042068 / 0.075469 (-0.033401)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.935662 / 1.841788 (-0.906126)	11.950247 / 8.074308 (3.875939)	10.636225 / 10.191392 (0.444833)	0.139137 / 0.680424 (-0.541287)	0.014473 / 0.534201 (-0.519728)	0.294213 / 0.579283 (-0.285070)	0.273413 / 0.434364 (-0.160951)	0.325930 / 0.540337 (-0.214407)	0.444265 / 1.386936 (-0.942671)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005448 / 0.011353 (-0.005904)	0.003155 / 0.011008 (-0.007853)	0.048626 / 0.038508 (0.010117)	0.057427 / 0.023109 (0.034318)	0.270412 / 0.275898 (-0.005486)	0.290816 / 0.323480 (-0.032664)	0.004744 / 0.007986 (-0.003241)	0.002776 / 0.004328 (-0.001552)	0.047953 / 0.004250 (0.043703)	0.041126 / 0.037052 (0.004073)	0.276046 / 0.258489 (0.017557)	0.297548 / 0.293841 (0.003707)	0.029308 / 0.128546 (-0.099238)	0.010516 / 0.075646 (-0.065131)	0.056982 / 0.419271 (-0.362290)	0.032922 / 0.043533 (-0.010611)	0.271342 / 0.255139 (0.016203)	0.288963 / 0.283200 (0.005763)	0.019048 / 0.141683 (-0.122635)	1.130453 / 1.452155 (-0.321702)	1.206462 / 1.492716 (-0.286254)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.099249 / 0.018006 (0.081242)	0.312409 / 0.000490 (0.311919)	0.000224 / 0.000200 (0.000024)	0.000044 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021992 / 0.037411 (-0.015419)	0.068377 / 0.014526 (0.053851)	0.080749 / 0.176557 (-0.095807)	0.120534 / 0.737135 (-0.616602)	0.082549 / 0.296338 (-0.213790)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.299634 / 0.215209 (0.084425)	2.943496 / 2.077655 (0.865841)	1.602842 / 1.504120 (0.098722)	1.462140 / 1.541195 (-0.079055)	1.511082 / 1.468490 (0.042592)	0.574148 / 4.584777 (-4.010629)	2.492158 / 3.745712 (-1.253554)	2.921695 / 5.269862 (-2.348166)	1.812416 / 4.565676 (-2.753260)	0.064145 / 0.424275 (-0.360130)	0.005133 / 0.007607 (-0.002475)	0.357935 / 0.226044 (0.131891)	3.543728 / 2.268929 (1.274800)	1.948676 / 55.444624 (-53.495948)	1.664960 / 6.876477 (-5.211517)	1.678703 / 2.142072 (-0.463370)	0.645867 / 4.805227 (-4.159360)	0.117671 / 6.500664 (-6.382993)	0.040887 / 0.075469 (-0.034582)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.979127 / 1.841788 (-0.862661)	12.363904 / 8.074308 (4.289596)	10.673725 / 10.191392 (0.482333)	0.143358 / 0.680424 (-0.537066)	0.015375 / 0.534201 (-0.518825)	0.287590 / 0.579283 (-0.291694)	0.284742 / 0.434364 (-0.149622)	0.326901 / 0.540337 (-0.213437)	0.443962 / 1.386936 (-0.942974)

github-actions · 2023-11-30T11:00:34Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004994 / 0.011353 (-0.006359)	0.003368 / 0.011008 (-0.007640)	0.062803 / 0.038508 (0.024295)	0.050778 / 0.023109 (0.027669)	0.255955 / 0.275898 (-0.019943)	0.278215 / 0.323480 (-0.045265)	0.003801 / 0.007986 (-0.004184)	0.002703 / 0.004328 (-0.001626)	0.048369 / 0.004250 (0.044119)	0.037795 / 0.037052 (0.000743)	0.255634 / 0.258489 (-0.002855)	0.284226 / 0.293841 (-0.009615)	0.027252 / 0.128546 (-0.101294)	0.010686 / 0.075646 (-0.064961)	0.206139 / 0.419271 (-0.213133)	0.035543 / 0.043533 (-0.007990)	0.257167 / 0.255139 (0.002028)	0.277784 / 0.283200 (-0.005416)	0.016938 / 0.141683 (-0.124745)	1.108595 / 1.452155 (-0.343560)	1.188542 / 1.492716 (-0.304175)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090938 / 0.018006 (0.072932)	0.298463 / 0.000490 (0.297973)	0.000203 / 0.000200 (0.000003)	0.000048 / 0.000054 (-0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027762 / 0.037411 (-0.009649)	0.060539 / 0.014526 (0.046014)	0.075986 / 0.176557 (-0.100570)	0.133851 / 0.737135 (-0.603285)	0.074669 / 0.296338 (-0.221670)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.285614 / 0.215209 (0.070405)	2.810529 / 2.077655 (0.732874)	1.537092 / 1.504120 (0.032973)	1.412211 / 1.541195 (-0.128983)	1.446395 / 1.468490 (-0.022095)	0.559008 / 4.584777 (-4.025769)	2.343445 / 3.745712 (-1.402267)	2.748113 / 5.269862 (-2.521748)	1.733593 / 4.565676 (-2.832083)	0.061720 / 0.424275 (-0.362555)	0.004930 / 0.007607 (-0.002677)	0.330646 / 0.226044 (0.104602)	3.314999 / 2.268929 (1.046071)	1.854527 / 55.444624 (-53.590098)	1.605819 / 6.876477 (-5.270657)	1.591406 / 2.142072 (-0.550667)	0.624239 / 4.805227 (-4.180988)	0.115352 / 6.500664 (-6.385312)	0.041600 / 0.075469 (-0.033869)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.933179 / 1.841788 (-0.908608)	11.456372 / 8.074308 (3.382064)	10.578042 / 10.191392 (0.386650)	0.128045 / 0.680424 (-0.552379)	0.014212 / 0.534201 (-0.519989)	0.284795 / 0.579283 (-0.294488)	0.266210 / 0.434364 (-0.168153)	0.344468 / 0.540337 (-0.195869)	0.434414 / 1.386936 (-0.952522)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005142 / 0.011353 (-0.006211)	0.003607 / 0.011008 (-0.007401)	0.048770 / 0.038508 (0.010262)	0.051147 / 0.023109 (0.028038)	0.277329 / 0.275898 (0.001430)	0.300863 / 0.323480 (-0.022617)	0.004005 / 0.007986 (-0.003980)	0.002624 / 0.004328 (-0.001705)	0.047740 / 0.004250 (0.043489)	0.040811 / 0.037052 (0.003759)	0.280020 / 0.258489 (0.021531)	0.303758 / 0.293841 (0.009918)	0.028273 / 0.128546 (-0.100274)	0.010379 / 0.075646 (-0.065267)	0.057503 / 0.419271 (-0.361768)	0.032717 / 0.043533 (-0.010816)	0.277560 / 0.255139 (0.022421)	0.300622 / 0.283200 (0.017422)	0.018142 / 0.141683 (-0.123541)	1.121890 / 1.452155 (-0.330265)	1.251481 / 1.492716 (-0.241235)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091523 / 0.018006 (0.073517)	0.300173 / 0.000490 (0.299683)	0.000216 / 0.000200 (0.000016)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.026386 / 0.037411 (-0.011025)	0.078710 / 0.014526 (0.064184)	0.090594 / 0.176557 (-0.085962)	0.130623 / 0.737135 (-0.606512)	0.092637 / 0.296338 (-0.203701)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.299427 / 0.215209 (0.084218)	2.929463 / 2.077655 (0.851808)	1.608905 / 1.504120 (0.104785)	1.490863 / 1.541195 (-0.050331)	1.484286 / 1.468490 (0.015796)	0.568208 / 4.584777 (-4.016569)	2.447081 / 3.745712 (-1.298632)	2.801287 / 5.269862 (-2.468574)	1.744449 / 4.565676 (-2.821227)	0.064222 / 0.424275 (-0.360053)	0.004959 / 0.007607 (-0.002648)	0.350207 / 0.226044 (0.124162)	3.471944 / 2.268929 (1.203016)	1.951715 / 55.444624 (-53.492909)	1.668764 / 6.876477 (-5.207713)	1.675322 / 2.142072 (-0.466751)	0.642217 / 4.805227 (-4.163011)	0.116776 / 6.500664 (-6.383888)	0.040812 / 0.075469 (-0.034658)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.996478 / 1.841788 (-0.845310)	12.090647 / 8.074308 (4.016339)	10.723688 / 10.191392 (0.532296)	0.141770 / 0.680424 (-0.538653)	0.015578 / 0.534201 (-0.518623)	0.288236 / 0.579283 (-0.291047)	0.278542 / 0.434364 (-0.155822)	0.327411 / 0.540337 (-0.212927)	0.450309 / 1.386936 (-0.936627)

github-actions · 2023-11-30T18:36:47Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.004967 / 0.011353 (-0.006385)	0.003382 / 0.011008 (-0.007627)	0.063436 / 0.038508 (0.024928)	0.050769 / 0.023109 (0.027659)	0.254214 / 0.275898 (-0.021684)	0.272076 / 0.323480 (-0.051404)	0.003815 / 0.007986 (-0.004170)	0.002618 / 0.004328 (-0.001711)	0.049021 / 0.004250 (0.044771)	0.037329 / 0.037052 (0.000277)	0.261112 / 0.258489 (0.002623)	0.284133 / 0.293841 (-0.009708)	0.026828 / 0.128546 (-0.101719)	0.010757 / 0.075646 (-0.064889)	0.208047 / 0.419271 (-0.211225)	0.035061 / 0.043533 (-0.008472)	0.250896 / 0.255139 (-0.004243)	0.273038 / 0.283200 (-0.010162)	0.016559 / 0.141683 (-0.125124)	1.128899 / 1.452155 (-0.323255)	1.188857 / 1.492716 (-0.303860)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.100121 / 0.018006 (0.082114)	0.298427 / 0.000490 (0.297937)	0.000218 / 0.000200 (0.000018)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018369 / 0.037411 (-0.019042)	0.060425 / 0.014526 (0.045899)	0.073501 / 0.176557 (-0.103055)	0.120254 / 0.737135 (-0.616881)	0.074889 / 0.296338 (-0.221450)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.287153 / 0.215209 (0.071944)	2.797036 / 2.077655 (0.719382)	1.446216 / 1.504120 (-0.057904)	1.336015 / 1.541195 (-0.205179)	1.369841 / 1.468490 (-0.098650)	0.559424 / 4.584777 (-4.025353)	2.361344 / 3.745712 (-1.384368)	2.766619 / 5.269862 (-2.503243)	1.747235 / 4.565676 (-2.818441)	0.066243 / 0.424275 (-0.358032)	0.004974 / 0.007607 (-0.002633)	0.333565 / 0.226044 (0.107520)	3.319877 / 2.268929 (1.050948)	1.798024 / 55.444624 (-53.646601)	1.495896 / 6.876477 (-5.380580)	1.529243 / 2.142072 (-0.612830)	0.636609 / 4.805227 (-4.168618)	0.116151 / 6.500664 (-6.384514)	0.041779 / 0.075469 (-0.033690)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.952176 / 1.841788 (-0.889611)	11.559160 / 8.074308 (3.484852)	10.556771 / 10.191392 (0.365379)	0.127118 / 0.680424 (-0.553306)	0.014142 / 0.534201 (-0.520059)	0.286585 / 0.579283 (-0.292698)	0.260233 / 0.434364 (-0.174131)	0.324012 / 0.540337 (-0.216326)	0.435131 / 1.386936 (-0.951805)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005171 / 0.011353 (-0.006182)	0.003402 / 0.011008 (-0.007607)	0.048826 / 0.038508 (0.010318)	0.050455 / 0.023109 (0.027346)	0.272120 / 0.275898 (-0.003778)	0.290404 / 0.323480 (-0.033076)	0.003986 / 0.007986 (-0.003999)	0.002569 / 0.004328 (-0.001760)	0.047845 / 0.004250 (0.043595)	0.040203 / 0.037052 (0.003150)	0.278263 / 0.258489 (0.019774)	0.299255 / 0.293841 (0.005414)	0.028643 / 0.128546 (-0.099903)	0.010584 / 0.075646 (-0.065062)	0.056921 / 0.419271 (-0.362351)	0.032362 / 0.043533 (-0.011171)	0.274010 / 0.255139 (0.018871)	0.288601 / 0.283200 (0.005401)	0.017856 / 0.141683 (-0.123827)	1.154112 / 1.452155 (-0.298043)	1.216288 / 1.492716 (-0.276428)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091399 / 0.018006 (0.073392)	0.299966 / 0.000490 (0.299477)	0.000218 / 0.000200 (0.000018)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021728 / 0.037411 (-0.015683)	0.068285 / 0.014526 (0.053759)	0.081767 / 0.176557 (-0.094789)	0.120000 / 0.737135 (-0.617135)	0.082149 / 0.296338 (-0.214189)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.289625 / 0.215209 (0.074416)	2.835114 / 2.077655 (0.757460)	1.583207 / 1.504120 (0.079087)	1.465251 / 1.541195 (-0.075944)	1.480691 / 1.468490 (0.012200)	0.569103 / 4.584777 (-4.015674)	2.416981 / 3.745712 (-1.328731)	2.761746 / 5.269862 (-2.508115)	1.720055 / 4.565676 (-2.845621)	0.063349 / 0.424275 (-0.360926)	0.004931 / 0.007607 (-0.002676)	0.343658 / 0.226044 (0.117614)	3.362996 / 2.268929 (1.094068)	1.948088 / 55.444624 (-53.496536)	1.659504 / 6.876477 (-5.216973)	1.660359 / 2.142072 (-0.481713)	0.647871 / 4.805227 (-4.157356)	0.117395 / 6.500664 (-6.383269)	0.041049 / 0.075469 (-0.034420)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.953971 / 1.841788 (-0.887817)	12.076998 / 8.074308 (4.002690)	10.549021 / 10.191392 (0.357629)	0.130026 / 0.680424 (-0.550398)	0.015697 / 0.534201 (-0.518504)	0.287125 / 0.579283 (-0.292158)	0.298402 / 0.434364 (-0.135962)	0.326005 / 0.540337 (-0.214332)	0.444065 / 1.386936 (-0.942871)

github-actions · 2023-11-30T18:37:27Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005053 / 0.011353 (-0.006300)	0.003537 / 0.011008 (-0.007472)	0.062923 / 0.038508 (0.024415)	0.053796 / 0.023109 (0.030687)	0.242523 / 0.275898 (-0.033375)	0.264014 / 0.323480 (-0.059466)	0.002879 / 0.007986 (-0.005106)	0.003273 / 0.004328 (-0.001055)	0.048735 / 0.004250 (0.044484)	0.037541 / 0.037052 (0.000488)	0.248587 / 0.258489 (-0.009902)	0.275531 / 0.293841 (-0.018310)	0.027215 / 0.128546 (-0.101331)	0.010466 / 0.075646 (-0.065180)	0.206508 / 0.419271 (-0.212763)	0.035606 / 0.043533 (-0.007927)	0.251044 / 0.255139 (-0.004095)	0.267183 / 0.283200 (-0.016016)	0.018357 / 0.141683 (-0.123326)	1.083513 / 1.452155 (-0.368642)	1.152988 / 1.492716 (-0.339728)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091749 / 0.018006 (0.073742)	0.299946 / 0.000490 (0.299456)	0.000212 / 0.000200 (0.000013)	0.000042 / 0.000054 (-0.000013)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018300 / 0.037411 (-0.019111)	0.060691 / 0.014526 (0.046166)	0.072998 / 0.176557 (-0.103559)	0.120581 / 0.737135 (-0.616554)	0.073912 / 0.296338 (-0.222427)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.277602 / 0.215209 (0.062393)	2.719181 / 2.077655 (0.641526)	1.450894 / 1.504120 (-0.053226)	1.314344 / 1.541195 (-0.226851)	1.351996 / 1.468490 (-0.116494)	0.586231 / 4.584777 (-3.998546)	2.349746 / 3.745712 (-1.395967)	2.810060 / 5.269862 (-2.459802)	1.761362 / 4.565676 (-2.804314)	0.062535 / 0.424275 (-0.361740)	0.004918 / 0.007607 (-0.002689)	0.336091 / 0.226044 (0.110047)	3.238139 / 2.268929 (0.969211)	1.769734 / 55.444624 (-53.674890)	1.505332 / 6.876477 (-5.371145)	1.527875 / 2.142072 (-0.614198)	0.640194 / 4.805227 (-4.165033)	0.116567 / 6.500664 (-6.384097)	0.042464 / 0.075469 (-0.033005)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.930919 / 1.841788 (-0.910869)	11.462498 / 8.074308 (3.388190)	10.575359 / 10.191392 (0.383967)	0.130567 / 0.680424 (-0.549857)	0.014203 / 0.534201 (-0.519998)	0.286944 / 0.579283 (-0.292339)	0.264706 / 0.434364 (-0.169658)	0.324820 / 0.540337 (-0.215517)	0.434579 / 1.386936 (-0.952357)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005164 / 0.011353 (-0.006189)	0.003442 / 0.011008 (-0.007567)	0.050146 / 0.038508 (0.011638)	0.050800 / 0.023109 (0.027691)	0.263405 / 0.275898 (-0.012493)	0.284876 / 0.323480 (-0.038604)	0.004011 / 0.007986 (-0.003975)	0.002602 / 0.004328 (-0.001726)	0.046742 / 0.004250 (0.042491)	0.040393 / 0.037052 (0.003341)	0.265052 / 0.258489 (0.006563)	0.294217 / 0.293841 (0.000377)	0.028429 / 0.128546 (-0.100118)	0.010418 / 0.075646 (-0.065228)	0.057285 / 0.419271 (-0.361987)	0.032137 / 0.043533 (-0.011396)	0.265867 / 0.255139 (0.010728)	0.284764 / 0.283200 (0.001564)	0.017448 / 0.141683 (-0.124235)	1.172830 / 1.452155 (-0.279325)	1.223982 / 1.492716 (-0.268735)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091859 / 0.018006 (0.073853)	0.285421 / 0.000490 (0.284931)	0.000220 / 0.000200 (0.000020)	0.000049 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021620 / 0.037411 (-0.015792)	0.069058 / 0.014526 (0.054532)	0.082560 / 0.176557 (-0.093997)	0.119511 / 0.737135 (-0.617624)	0.082318 / 0.296338 (-0.214021)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291499 / 0.215209 (0.076290)	2.863352 / 2.077655 (0.785698)	1.557242 / 1.504120 (0.053122)	1.430170 / 1.541195 (-0.111024)	1.432850 / 1.468490 (-0.035640)	0.559716 / 4.584777 (-4.025061)	2.385405 / 3.745712 (-1.360307)	2.748938 / 5.269862 (-2.520924)	1.740802 / 4.565676 (-2.824874)	0.061811 / 0.424275 (-0.362465)	0.005174 / 0.007607 (-0.002433)	0.348687 / 0.226044 (0.122642)	3.420120 / 2.268929 (1.151191)	1.918278 / 55.444624 (-53.526346)	1.631559 / 6.876477 (-5.244918)	1.635850 / 2.142072 (-0.506222)	0.644144 / 4.805227 (-4.161083)	0.115823 / 6.500664 (-6.384841)	0.041255 / 0.075469 (-0.034214)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.960066 / 1.841788 (-0.881722)	12.011372 / 8.074308 (3.937064)	10.580532 / 10.191392 (0.389140)	0.134763 / 0.680424 (-0.545661)	0.017027 / 0.534201 (-0.517174)	0.290484 / 0.579283 (-0.288799)	0.285171 / 0.434364 (-0.149193)	0.322453 / 0.540337 (-0.217884)	0.438088 / 1.386936 (-0.948848)

github-actions · 2023-11-30T18:54:14Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005212 / 0.011353 (-0.006141)	0.003440 / 0.011008 (-0.007568)	0.063612 / 0.038508 (0.025104)	0.049070 / 0.023109 (0.025961)	0.269748 / 0.275898 (-0.006150)	0.283270 / 0.323480 (-0.040210)	0.002892 / 0.007986 (-0.005094)	0.002693 / 0.004328 (-0.001635)	0.049710 / 0.004250 (0.045459)	0.036707 / 0.037052 (-0.000345)	0.299035 / 0.258489 (0.040546)	0.296443 / 0.293841 (0.002602)	0.028095 / 0.128546 (-0.100451)	0.010682 / 0.075646 (-0.064964)	0.213914 / 0.419271 (-0.205358)	0.036210 / 0.043533 (-0.007323)	0.235720 / 0.255139 (-0.019419)	0.252687 / 0.283200 (-0.030512)	0.016985 / 0.141683 (-0.124698)	1.099024 / 1.452155 (-0.353130)	1.162970 / 1.492716 (-0.329746)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093114 / 0.018006 (0.075108)	0.305168 / 0.000490 (0.304678)	0.000216 / 0.000200 (0.000016)	0.000043 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018370 / 0.037411 (-0.019041)	0.060534 / 0.014526 (0.046008)	0.073960 / 0.176557 (-0.102596)	0.120325 / 0.737135 (-0.616810)	0.073754 / 0.296338 (-0.222585)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.284244 / 0.215209 (0.069035)	2.756854 / 2.077655 (0.679199)	1.477304 / 1.504120 (-0.026816)	1.374635 / 1.541195 (-0.166560)	1.383284 / 1.468490 (-0.085206)	0.564656 / 4.584777 (-4.020121)	2.361719 / 3.745712 (-1.383993)	2.794822 / 5.269862 (-2.475039)	1.742981 / 4.565676 (-2.822696)	0.063443 / 0.424275 (-0.360832)	0.004952 / 0.007607 (-0.002655)	0.342058 / 0.226044 (0.116014)	3.351093 / 2.268929 (1.082164)	1.857375 / 55.444624 (-53.587250)	1.541680 / 6.876477 (-5.334797)	1.580147 / 2.142072 (-0.561926)	0.645216 / 4.805227 (-4.160012)	0.118768 / 6.500664 (-6.381896)	0.042115 / 0.075469 (-0.033354)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.925845 / 1.841788 (-0.915943)	11.444147 / 8.074308 (3.369839)	10.291297 / 10.191392 (0.099905)	0.128129 / 0.680424 (-0.552295)	0.013774 / 0.534201 (-0.520427)	0.289278 / 0.579283 (-0.290005)	0.262353 / 0.434364 (-0.172011)	0.328517 / 0.540337 (-0.211820)	0.436050 / 1.386936 (-0.950886)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005666 / 0.011353 (-0.005687)	0.003691 / 0.011008 (-0.007318)	0.049361 / 0.038508 (0.010853)	0.054245 / 0.023109 (0.031136)	0.274433 / 0.275898 (-0.001465)	0.285648 / 0.323480 (-0.037832)	0.004080 / 0.007986 (-0.003906)	0.002666 / 0.004328 (-0.001663)	0.047539 / 0.004250 (0.043288)	0.041001 / 0.037052 (0.003948)	0.296018 / 0.258489 (0.037529)	0.294542 / 0.293841 (0.000701)	0.030546 / 0.128546 (-0.098001)	0.010556 / 0.075646 (-0.065090)	0.058146 / 0.419271 (-0.361126)	0.033407 / 0.043533 (-0.010126)	0.263977 / 0.255139 (0.008838)	0.286228 / 0.283200 (0.003028)	0.018088 / 0.141683 (-0.123595)	1.121295 / 1.452155 (-0.330860)	1.182183 / 1.492716 (-0.310533)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.104540 / 0.018006 (0.086534)	0.303494 / 0.000490 (0.303004)	0.000222 / 0.000200 (0.000022)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021274 / 0.037411 (-0.016137)	0.070146 / 0.014526 (0.055621)	0.080343 / 0.176557 (-0.096213)	0.120017 / 0.737135 (-0.617119)	0.081303 / 0.296338 (-0.215036)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.294390 / 0.215209 (0.079181)	2.883366 / 2.077655 (0.805711)	1.564629 / 1.504120 (0.060509)	1.432633 / 1.541195 (-0.108562)	1.438786 / 1.468490 (-0.029704)	0.569663 / 4.584777 (-4.015114)	2.448691 / 3.745712 (-1.297021)	2.817010 / 5.269862 (-2.452851)	1.757274 / 4.565676 (-2.808402)	0.064147 / 0.424275 (-0.360129)	0.004910 / 0.007607 (-0.002697)	0.344062 / 0.226044 (0.118018)	3.394223 / 2.268929 (1.125294)	1.927139 / 55.444624 (-53.517485)	1.624983 / 6.876477 (-5.251494)	1.629076 / 2.142072 (-0.512996)	0.654239 / 4.805227 (-4.150988)	0.117309 / 6.500664 (-6.383355)	0.041067 / 0.075469 (-0.034402)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.993184 / 1.841788 (-0.848604)	11.969985 / 8.074308 (3.895677)	10.363356 / 10.191392 (0.171964)	0.130708 / 0.680424 (-0.549716)	0.015577 / 0.534201 (-0.518624)	0.289579 / 0.579283 (-0.289704)	0.274875 / 0.434364 (-0.159488)	0.326736 / 0.540337 (-0.213601)	0.442770 / 1.386936 (-0.944166)

lhoestq · 2023-12-04T11:44:46Z

Getting the same windows error as in my other PR. I couldn't reproduce on my windows machine though 🧐

mariosasko

Maybe we can avoid adding this much complexity for the YAML case (not used much?) by turning DataFilesList into a lazy iterable that caches its elements as it's being iterated over (we don't need random access, so no need for the list).

lhoestq · 2023-12-08T16:26:02Z

DataFilesList is a list so we expect to be able to get its length with zero cost, which wouldn't be the case if we make it lazy no ?

mariosasko · 2023-12-08T16:30:44Z

But we don't call len on it, do we? And I couldn't find an instance of DataFilesList being used in GitHub's public repos.

lhoestq · 2023-12-08T16:31:23Z

DataFilesDict is used in some repositories in dataset scripts when people want to list files from a repo using glob patterns

lhoestq · 2023-12-08T16:39:19Z

Also making DataFilesList lazy would require to make the pickling more complex, since we don't want to resolve the data files when pickling. At the same time we want to get different hashes if the data files and origin metadata are different so revolving the patterns is needed in that case (we hash the data files when creating the config_id, used in the cache)

mariosasko · 2023-12-08T16:55:31Z

DataFilesDict is used in some repositories in dataset scripts when people want to list files from a repo using glob patterns

Would be interesting to know how often these scripts call len or do random access on DataFilesList.

Still, I think we should opt for a solution that makes more sense for us. To avoid the breaking change, we can define a BuilderConfig.data_files property that resolves this iterable.

Also making DataFilesList lazy would require to make the pickling more complex, since we don't want to resolve the data files when pickling. At the same time we want to get different hashes if the data files and origin metadata are different so revolving the patterns is needed in that case (we hash the data files when creating the config_id, used in the cache)

The BuilderConfig.data_files property suggested above should address this, no?

I think we should be more careful not to make our API needlessly complex because of the YAML README feature. And if this can't be avoided, we should probably refactor the builder API.

lhoestq · 2023-12-08T16:58:46Z

The BuilderConfig.data_files property suggested above should address this, no?

That works indeed ! let me try something

lhoestq · 2023-12-08T20:38:21Z

Implementing lazy DataFilesList and .data_files brings more complexity (less readable, more bad side effects) so I think the current solution is the best one

lhoestq · 2023-12-12T23:30:31Z

I opened #6493 to continue this and fix conflicts with #6459

lazy data files resolution

51002cb

lhoestq changed the title ~~Llazy data files resolution~~ Lazy data files resolution Nov 29, 2023

lhoestq added 2 commits November 29, 2023 16:22

fix tests

5a5bb38

minor

214a3e6

lhoestq commented Nov 29, 2023

View reviewed changes

src/datasets/data_files.py Outdated Show resolved Hide resolved

don't use expand_info=False yet

b7a9674

fix

32e0960

style

68099ca

Merge branch 'main' into lazy-data_files_resolution

5dd4698

lhoestq mentioned this pull request Nov 30, 2023

Missing DatasetNotFoundError #6462

Merged

lhoestq and others added 2 commits November 30, 2023 19:30

Merge branch 'main' into lazy-data_files_resolution

cf86d48

tests

b3fc428

fix win test

796a47e

lhoestq and others added 5 commits December 1, 2023 16:52

Merge branch 'main' into lazy-data_files_resolution

21209d2

fix tests

a924e94

Merge branch 'main' into lazy-data_files_resolution

adc07dd

fix tests again

c9ecfca

remove unused code

ddb488b

lhoestq marked this pull request as ready for review December 4, 2023 11:40

lhoestq requested review from albertvillanova and mariosasko December 4, 2023 12:01

mariosasko reviewed Dec 8, 2023

View reviewed changes

fix cache on config change

565c294

lhoestq mentioned this pull request Dec 11, 2023

Update builder hash with info #6487

Closed

lhoestq added 2 commits December 11, 2023 12:45

simpler

e01938d

fix tests

09516db

This was referenced Dec 12, 2023

Retrieve cached datasets that were pushed to hub when offline #6459

Closed

Lazy data files resolution and offline cache reload #6493

Merged

lhoestq closed this Feb 8, 2024

Lazy data files resolution #6458

Lazy data files resolution #6458

Uh oh!

Conversation

lhoestq commented Nov 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Nov 29, 2023

Uh oh!

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 29, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

lhoestq commented Nov 29, 2023 •

edited

Loading

lhoestq commented Dec 8, 2023 •

edited

Loading