Improve logging #6019

mariosasko · 2023-07-11T18:30:23Z

Adds the StreamHandler (as hfh and transformers do) to the library's logger to log INFO messages and logs the messages about "loading a cached result" (and some other warnings) as INFO

(Also removes the leave=False arg in the progress bars to be consistent with hfh and transformers - progress bars serve as an indicator that a result is not cached, so it makes more sense not to delete them)

Fix #2832, fix #1948, fix #5444

HuggingFaceDocBuilderDev · 2023-07-11T18:35:49Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-07-11T18:37:54Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007782 / 0.011353 (-0.003571)	0.004451 / 0.011008 (-0.006557)	0.099928 / 0.038508 (0.061420)	0.081534 / 0.023109 (0.058425)	0.379382 / 0.275898 (0.103484)	0.410652 / 0.323480 (0.087172)	0.005967 / 0.007986 (-0.002019)	0.003702 / 0.004328 (-0.000627)	0.076359 / 0.004250 (0.072109)	0.066721 / 0.037052 (0.029669)	0.383595 / 0.258489 (0.125106)	0.423854 / 0.293841 (0.130013)	0.032796 / 0.128546 (-0.095750)	0.009728 / 0.075646 (-0.065918)	0.344347 / 0.419271 (-0.074925)	0.056320 / 0.043533 (0.012788)	0.379974 / 0.255139 (0.124835)	0.401294 / 0.283200 (0.118094)	0.024110 / 0.141683 (-0.117572)	1.804194 / 1.452155 (0.352039)	1.860240 / 1.492716 (0.367523)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.233803 / 0.018006 (0.215797)	0.506893 / 0.000490 (0.506404)	0.003894 / 0.000200 (0.003694)	0.000090 / 0.000054 (0.000035)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.033328 / 0.037411 (-0.004083)	0.098661 / 0.014526 (0.084136)	0.114971 / 0.176557 (-0.061586)	0.186815 / 0.737135 (-0.550321)	0.115490 / 0.296338 (-0.180848)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.422590 / 0.215209 (0.207381)	4.277189 / 2.077655 (2.199535)	2.095565 / 1.504120 (0.591445)	2.040825 / 1.541195 (0.499630)	2.162562 / 1.468490 (0.694072)	0.578602 / 4.584777 (-4.006175)	4.203474 / 3.745712 (0.457762)	6.674595 / 5.269862 (1.404734)	3.913251 / 4.565676 (-0.652426)	0.067777 / 0.424275 (-0.356498)	0.008716 / 0.007607 (0.001109)	0.548704 / 0.226044 (0.322660)	5.162120 / 2.268929 (2.893192)	2.600250 / 55.444624 (-52.844374)	2.232730 / 6.876477 (-4.643747)	2.485617 / 2.142072 (0.343544)	0.650872 / 4.805227 (-4.154355)	0.148022 / 6.500664 (-6.352642)	0.064795 / 0.075469 (-0.010674)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.399439 / 1.841788 (-0.442349)	22.438959 / 8.074308 (14.364651)	16.447831 / 10.191392 (6.256439)	0.202003 / 0.680424 (-0.478421)	0.026200 / 0.534201 (-0.508001)	0.472966 / 0.579283 (-0.106317)	0.491621 / 0.434364 (0.057257)	0.551580 / 0.540337 (0.011242)	0.751420 / 1.386936 (-0.635516)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007241 / 0.011353 (-0.004112)	0.004434 / 0.011008 (-0.006574)	0.075872 / 0.038508 (0.037364)	0.080094 / 0.023109 (0.056985)	0.459244 / 0.275898 (0.183346)	0.492482 / 0.323480 (0.169002)	0.005791 / 0.007986 (-0.002194)	0.003657 / 0.004328 (-0.000671)	0.075214 / 0.004250 (0.070964)	0.064208 / 0.037052 (0.027156)	0.464195 / 0.258489 (0.205706)	0.497809 / 0.293841 (0.203968)	0.036301 / 0.128546 (-0.092245)	0.009855 / 0.075646 (-0.065791)	0.080826 / 0.419271 (-0.338445)	0.056700 / 0.043533 (0.013167)	0.452850 / 0.255139 (0.197711)	0.490738 / 0.283200 (0.207538)	0.024145 / 0.141683 (-0.117538)	1.689911 / 1.452155 (0.237757)	1.789803 / 1.492716 (0.297087)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.247741 / 0.018006 (0.229735)	0.486769 / 0.000490 (0.486279)	0.000418 / 0.000200 (0.000218)	0.000060 / 0.000054 (0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036317 / 0.037411 (-0.001094)	0.104943 / 0.014526 (0.090417)	0.120972 / 0.176557 (-0.055585)	0.188461 / 0.737135 (-0.548674)	0.120926 / 0.296338 (-0.175412)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.465788 / 0.215209 (0.250579)	4.662369 / 2.077655 (2.584714)	2.442241 / 1.504120 (0.938121)	2.266328 / 1.541195 (0.725133)	2.438998 / 1.468490 (0.970508)	0.531384 / 4.584777 (-4.053393)	4.125286 / 3.745712 (0.379574)	3.920912 / 5.269862 (-1.348950)	2.292149 / 4.565676 (-2.273528)	0.070146 / 0.424275 (-0.354129)	0.008887 / 0.007607 (0.001280)	0.598181 / 0.226044 (0.372137)	5.726454 / 2.268929 (3.457526)	3.081836 / 55.444624 (-52.362788)	2.683508 / 6.876477 (-4.192969)	2.587350 / 2.142072 (0.445278)	0.604736 / 4.805227 (-4.200491)	0.141303 / 6.500664 (-6.359362)	0.065020 / 0.075469 (-0.010449)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.481850 / 1.841788 (-0.359938)	22.259592 / 8.074308 (14.185284)	16.304290 / 10.191392 (6.112898)	0.173514 / 0.680424 (-0.506909)	0.021590 / 0.534201 (-0.512611)	0.471753 / 0.579283 (-0.107531)	0.472132 / 0.434364 (0.037768)	0.563344 / 0.540337 (0.023007)	0.738509 / 1.386936 (-0.648427)

github-actions · 2023-07-11T18:38:43Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005910 / 0.011353 (-0.005443)	0.004372 / 0.011008 (-0.006636)	0.081583 / 0.038508 (0.043075)	0.069598 / 0.023109 (0.046488)	0.346360 / 0.275898 (0.070462)	0.360733 / 0.323480 (0.037254)	0.004725 / 0.007986 (-0.003261)	0.003106 / 0.004328 (-0.001222)	0.059916 / 0.004250 (0.055666)	0.053242 / 0.037052 (0.016189)	0.353551 / 0.258489 (0.095062)	0.373052 / 0.293841 (0.079211)	0.029036 / 0.128546 (-0.099510)	0.007894 / 0.075646 (-0.067753)	0.284131 / 0.419271 (-0.135140)	0.049348 / 0.043533 (0.005815)	0.347409 / 0.255139 (0.092270)	0.355029 / 0.283200 (0.071830)	0.022511 / 0.141683 (-0.119171)	1.454495 / 1.452155 (0.002340)	1.439551 / 1.492716 (-0.053166)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.218889 / 0.018006 (0.200883)	0.478734 / 0.000490 (0.478244)	0.003758 / 0.000200 (0.003558)	0.000083 / 0.000054 (0.000029)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.025759 / 0.037411 (-0.011653)	0.082511 / 0.014526 (0.067985)	0.087578 / 0.176557 (-0.088979)	0.137760 / 0.737135 (-0.599375)	0.093312 / 0.296338 (-0.203027)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.378963 / 0.215209 (0.163754)	3.645846 / 2.077655 (1.568191)	1.741135 / 1.504120 (0.237015)	1.599166 / 1.541195 (0.057972)	1.610817 / 1.468490 (0.142327)	0.459209 / 4.584777 (-4.125568)	3.484857 / 3.745712 (-0.260855)	3.928109 / 5.269862 (-1.341752)	2.419784 / 4.565676 (-2.145892)	0.051987 / 0.424275 (-0.372288)	0.006495 / 0.007607 (-0.001112)	0.427311 / 0.226044 (0.201267)	4.226378 / 2.268929 (1.957450)	2.212331 / 55.444624 (-53.232293)	1.916213 / 6.876477 (-4.960264)	1.978809 / 2.142072 (-0.163263)	0.547351 / 4.805227 (-4.257876)	0.121110 / 6.500664 (-6.379554)	0.054163 / 0.075469 (-0.021306)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.228594 / 1.841788 (-0.613193)	19.410901 / 8.074308 (11.336593)	13.014722 / 10.191392 (2.823330)	0.156449 / 0.680424 (-0.523975)	0.021032 / 0.534201 (-0.513169)	0.403976 / 0.579283 (-0.175307)	0.413885 / 0.434364 (-0.020479)	0.470465 / 0.540337 (-0.069873)	0.641322 / 1.386936 (-0.745614)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007210 / 0.011353 (-0.004143)	0.003824 / 0.011008 (-0.007185)	0.058227 / 0.038508 (0.019719)	0.076211 / 0.023109 (0.053102)	0.336626 / 0.275898 (0.060728)	0.420542 / 0.323480 (0.097062)	0.006178 / 0.007986 (-0.001808)	0.003332 / 0.004328 (-0.000997)	0.058073 / 0.004250 (0.053823)	0.062485 / 0.037052 (0.025432)	0.386175 / 0.258489 (0.127686)	0.415659 / 0.293841 (0.121818)	0.031264 / 0.128546 (-0.097282)	0.007502 / 0.075646 (-0.068144)	0.072079 / 0.419271 (-0.347192)	0.055860 / 0.043533 (0.012327)	0.343508 / 0.255139 (0.088369)	0.437844 / 0.283200 (0.154645)	0.032852 / 0.141683 (-0.108831)	1.409241 / 1.452155 (-0.042913)	1.623949 / 1.492716 (0.131233)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.207511 / 0.018006 (0.189504)	0.464149 / 0.000490 (0.463660)	0.003248 / 0.000200 (0.003048)	0.000226 / 0.000054 (0.000172)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030767 / 0.037411 (-0.006645)	0.079169 / 0.014526 (0.064643)	0.093111 / 0.176557 (-0.083445)	0.153369 / 0.737135 (-0.583767)	0.092939 / 0.296338 (-0.203400)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.375602 / 0.215209 (0.160392)	3.968612 / 2.077655 (1.890957)	2.081749 / 1.504120 (0.577629)	1.899772 / 1.541195 (0.358577)	1.847923 / 1.468490 (0.379433)	0.442867 / 4.584777 (-4.141910)	3.646664 / 3.745712 (-0.099048)	5.870600 / 5.269862 (0.600739)	3.356698 / 4.565676 (-1.208979)	0.051422 / 0.424275 (-0.372853)	0.006006 / 0.007607 (-0.001601)	0.442439 / 0.226044 (0.216395)	4.466256 / 2.268929 (2.197328)	2.483832 / 55.444624 (-52.960792)	2.105612 / 6.876477 (-4.770865)	2.060650 / 2.142072 (-0.081422)	0.531119 / 4.805227 (-4.274108)	0.123436 / 6.500664 (-6.377228)	0.059838 / 0.075469 (-0.015632)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.283042 / 1.841788 (-0.558746)	19.688251 / 8.074308 (11.613943)	13.346386 / 10.191392 (3.154994)	0.197463 / 0.680424 (-0.482961)	0.018484 / 0.534201 (-0.515717)	0.391727 / 0.579283 (-0.187556)	0.425061 / 0.434364 (-0.009303)	0.448177 / 0.540337 (-0.092160)	0.653694 / 1.386936 (-0.733242)

github-actions · 2023-07-11T18:49:11Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.008966 / 0.011353 (-0.002387)	0.005195 / 0.011008 (-0.005813)	0.102879 / 0.038508 (0.064371)	0.090902 / 0.023109 (0.067792)	0.434397 / 0.275898 (0.158498)	0.454013 / 0.323480 (0.130534)	0.008507 / 0.007986 (0.000521)	0.005000 / 0.004328 (0.000671)	0.075789 / 0.004250 (0.071538)	0.067608 / 0.037052 (0.030555)	0.435091 / 0.258489 (0.176602)	0.469411 / 0.293841 (0.175570)	0.050859 / 0.128546 (-0.077687)	0.013560 / 0.075646 (-0.062086)	0.345473 / 0.419271 (-0.073799)	0.094974 / 0.043533 (0.051441)	0.429626 / 0.255139 (0.174487)	0.434290 / 0.283200 (0.151090)	0.052269 / 0.141683 (-0.089413)	1.700549 / 1.452155 (0.248395)	1.890693 / 1.492716 (0.397976)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.296618 / 0.018006 (0.278612)	0.613908 / 0.000490 (0.613419)	0.000484 / 0.000200 (0.000284)	0.000086 / 0.000054 (0.000032)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.034346 / 0.037411 (-0.003065)	0.096836 / 0.014526 (0.082310)	0.113332 / 0.176557 (-0.063224)	0.194464 / 0.737135 (-0.542671)	0.111732 / 0.296338 (-0.184606)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.624954 / 0.215209 (0.409745)	6.442193 / 2.077655 (4.364538)	2.818331 / 1.504120 (1.314211)	2.529607 / 1.541195 (0.988413)	2.549026 / 1.468490 (1.080536)	0.967367 / 4.584777 (-3.617410)	5.446885 / 3.745712 (1.701173)	6.259099 / 5.269862 (0.989237)	3.652936 / 4.565676 (-0.912740)	0.106420 / 0.424275 (-0.317855)	0.011293 / 0.007607 (0.003686)	0.772026 / 0.226044 (0.545982)	7.823986 / 2.268929 (5.555057)	3.725328 / 55.444624 (-51.719297)	2.851489 / 6.876477 (-4.024988)	3.013722 / 2.142072 (0.871649)	1.045090 / 4.805227 (-3.760137)	0.213174 / 6.500664 (-6.287490)	0.077104 / 0.075469 (0.001635)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.657135 / 1.841788 (-0.184652)	24.547604 / 8.074308 (16.473296)	19.989533 / 10.191392 (9.798141)	0.257139 / 0.680424 (-0.423285)	0.028448 / 0.534201 (-0.505753)	0.490801 / 0.579283 (-0.088482)	0.628072 / 0.434364 (0.193708)	0.584873 / 0.540337 (0.044536)	0.825258 / 1.386936 (-0.561678)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009258 / 0.011353 (-0.002095)	0.005660 / 0.011008 (-0.005348)	0.080577 / 0.038508 (0.042069)	0.095786 / 0.023109 (0.072676)	0.473334 / 0.275898 (0.197436)	0.527962 / 0.323480 (0.204482)	0.006537 / 0.007986 (-0.001449)	0.004411 / 0.004328 (0.000083)	0.080702 / 0.004250 (0.076452)	0.077020 / 0.037052 (0.039968)	0.483205 / 0.258489 (0.224716)	0.556916 / 0.293841 (0.263076)	0.047670 / 0.128546 (-0.080877)	0.016647 / 0.075646 (-0.058999)	0.090653 / 0.419271 (-0.328619)	0.062122 / 0.043533 (0.018589)	0.498326 / 0.255139 (0.243187)	0.546572 / 0.283200 (0.263372)	0.037525 / 0.141683 (-0.104157)	1.869520 / 1.452155 (0.417365)	1.915335 / 1.492716 (0.422619)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.248287 / 0.018006 (0.230281)	0.611440 / 0.000490 (0.610950)	0.004102 / 0.000200 (0.003902)	0.000132 / 0.000054 (0.000078)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.038228 / 0.037411 (0.000817)	0.103510 / 0.014526 (0.088984)	0.114337 / 0.176557 (-0.062219)	0.189662 / 0.737135 (-0.547473)	0.119078 / 0.296338 (-0.177260)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.606622 / 0.215209 (0.391413)	6.053900 / 2.077655 (3.976246)	2.857972 / 1.504120 (1.353852)	2.549756 / 1.541195 (1.008561)	2.584557 / 1.468490 (1.116067)	0.930431 / 4.584777 (-3.654346)	5.524077 / 3.745712 (1.778365)	7.858406 / 5.269862 (2.588545)	4.890697 / 4.565676 (0.325020)	0.095356 / 0.424275 (-0.328919)	0.008614 / 0.007607 (0.001007)	0.774227 / 0.226044 (0.548182)	7.470215 / 2.268929 (5.201287)	3.784820 / 55.444624 (-51.659805)	3.199364 / 6.876477 (-3.677113)	3.212002 / 2.142072 (1.069929)	1.054104 / 4.805227 (-3.751123)	0.226044 / 6.500664 (-6.274620)	0.092237 / 0.075469 (0.016768)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.801054 / 1.841788 (-0.040734)	24.220404 / 8.074308 (16.146096)	21.652936 / 10.191392 (11.461544)	0.247004 / 0.680424 (-0.433420)	0.029651 / 0.534201 (-0.504550)	0.475702 / 0.579283 (-0.103581)	0.621121 / 0.434364 (0.186757)	0.570489 / 0.540337 (0.030151)	0.768840 / 1.386936 (-0.618096)

github-actions · 2023-07-11T19:37:25Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.009223 / 0.011353 (-0.002130)	0.005750 / 0.011008 (-0.005258)	0.105264 / 0.038508 (0.066756)	0.088478 / 0.023109 (0.065369)	0.461119 / 0.275898 (0.185221)	0.481115 / 0.323480 (0.157636)	0.006366 / 0.007986 (-0.001619)	0.004515 / 0.004328 (0.000186)	0.079296 / 0.004250 (0.075045)	0.063483 / 0.037052 (0.026430)	0.444490 / 0.258489 (0.186001)	0.496474 / 0.293841 (0.202634)	0.048568 / 0.128546 (-0.079978)	0.013574 / 0.075646 (-0.062073)	0.379213 / 0.419271 (-0.040059)	0.086464 / 0.043533 (0.042932)	0.437526 / 0.255139 (0.182387)	0.447117 / 0.283200 (0.163917)	0.049502 / 0.141683 (-0.092180)	1.749146 / 1.452155 (0.296992)	1.831082 / 1.492716 (0.338365)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.268205 / 0.018006 (0.250199)	0.627406 / 0.000490 (0.626917)	0.005439 / 0.000200 (0.005239)	0.000128 / 0.000054 (0.000074)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030564 / 0.037411 (-0.006848)	0.096365 / 0.014526 (0.081840)	0.117484 / 0.176557 (-0.059072)	0.189104 / 0.737135 (-0.548032)	0.118073 / 0.296338 (-0.178266)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.618229 / 0.215209 (0.403019)	6.437853 / 2.077655 (4.360199)	2.789946 / 1.504120 (1.285826)	2.339245 / 1.541195 (0.798050)	2.588779 / 1.468490 (1.120289)	0.921008 / 4.584777 (-3.663769)	5.402940 / 3.745712 (1.657227)	4.818783 / 5.269862 (-0.451078)	3.162259 / 4.565676 (-1.403417)	0.108501 / 0.424275 (-0.315774)	0.009384 / 0.007607 (0.001777)	0.766811 / 0.226044 (0.540766)	7.624629 / 2.268929 (5.355701)	3.442420 / 55.444624 (-52.002204)	2.759967 / 6.876477 (-4.116510)	3.049644 / 2.142072 (0.907572)	1.113308 / 4.805227 (-3.691919)	0.223923 / 6.500664 (-6.276741)	0.079156 / 0.075469 (0.003687)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.683318 / 1.841788 (-0.158470)	25.062141 / 8.074308 (16.987833)	21.777131 / 10.191392 (11.585739)	0.266939 / 0.680424 (-0.413485)	0.029670 / 0.534201 (-0.504531)	0.476761 / 0.579283 (-0.102522)	0.622080 / 0.434364 (0.187716)	0.601781 / 0.540337 (0.061443)	0.785126 / 1.386936 (-0.601811)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.010198 / 0.011353 (-0.001155)	0.005777 / 0.011008 (-0.005231)	0.083003 / 0.038508 (0.044495)	0.093411 / 0.023109 (0.070302)	0.496178 / 0.275898 (0.220280)	0.554670 / 0.323480 (0.231190)	0.008351 / 0.007986 (0.000365)	0.004678 / 0.004328 (0.000350)	0.083631 / 0.004250 (0.079381)	0.075538 / 0.037052 (0.038485)	0.492410 / 0.258489 (0.233921)	0.545209 / 0.293841 (0.251368)	0.048365 / 0.128546 (-0.080181)	0.014219 / 0.075646 (-0.061427)	0.100749 / 0.419271 (-0.318523)	0.063431 / 0.043533 (0.019898)	0.511115 / 0.255139 (0.255976)	0.532965 / 0.283200 (0.249765)	0.037968 / 0.141683 (-0.103715)	1.940268 / 1.452155 (0.488113)	2.032934 / 1.492716 (0.540217)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.238179 / 0.018006 (0.220172)	0.605767 / 0.000490 (0.605277)	0.004033 / 0.000200 (0.003833)	0.000125 / 0.000054 (0.000071)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.036436 / 0.037411 (-0.000975)	0.108034 / 0.014526 (0.093509)	0.118624 / 0.176557 (-0.057933)	0.183079 / 0.737135 (-0.554056)	0.121739 / 0.296338 (-0.174600)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.630538 / 0.215209 (0.415329)	6.552184 / 2.077655 (4.474529)	3.003412 / 1.504120 (1.499292)	2.669026 / 1.541195 (1.127832)	2.791109 / 1.468490 (1.322619)	0.884003 / 4.584777 (-3.700774)	5.538660 / 3.745712 (1.792947)	5.126708 / 5.269862 (-0.143154)	3.120825 / 4.565676 (-1.444852)	0.101178 / 0.424275 (-0.323097)	0.009027 / 0.007607 (0.001420)	0.785914 / 0.226044 (0.559869)	7.994720 / 2.268929 (5.725792)	4.061996 / 55.444624 (-51.382629)	3.263230 / 6.876477 (-3.613247)	3.288622 / 2.142072 (1.146550)	1.141867 / 4.805227 (-3.663360)	0.255287 / 6.500664 (-6.245378)	0.100637 / 0.075469 (0.025168)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.769821 / 1.841788 (-0.071967)	24.994008 / 8.074308 (16.919700)	21.765971 / 10.191392 (11.574579)	0.268493 / 0.680424 (-0.411931)	0.028047 / 0.534201 (-0.506154)	0.489472 / 0.579283 (-0.089811)	0.594809 / 0.434364 (0.160445)	0.613578 / 0.540337 (0.073241)	0.879360 / 1.386936 (-0.507576)

github-actions · 2023-07-11T20:10:41Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006003 / 0.011353 (-0.005350)	0.003590 / 0.011008 (-0.007418)	0.084657 / 0.038508 (0.046149)	0.057884 / 0.023109 (0.034775)	0.318347 / 0.275898 (0.042449)	0.345976 / 0.323480 (0.022496)	0.004706 / 0.007986 (-0.003279)	0.002921 / 0.004328 (-0.001407)	0.061850 / 0.004250 (0.057600)	0.050558 / 0.037052 (0.013505)	0.320877 / 0.258489 (0.062388)	0.356062 / 0.293841 (0.062222)	0.027511 / 0.128546 (-0.101035)	0.007954 / 0.075646 (-0.067693)	0.260290 / 0.419271 (-0.158981)	0.051207 / 0.043533 (0.007674)	0.334423 / 0.255139 (0.079284)	0.338575 / 0.283200 (0.055375)	0.022330 / 0.141683 (-0.119353)	1.445446 / 1.452155 (-0.006709)	1.500626 / 1.492716 (0.007910)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.192440 / 0.018006 (0.174433)	0.428455 / 0.000490 (0.427965)	0.000318 / 0.000200 (0.000118)	0.000056 / 0.000054 (0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022933 / 0.037411 (-0.014478)	0.072795 / 0.014526 (0.058269)	0.081149 / 0.176557 (-0.095407)	0.142941 / 0.737135 (-0.594195)	0.082410 / 0.296338 (-0.213928)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.405220 / 0.215209 (0.190011)	4.048585 / 2.077655 (1.970931)	2.027908 / 1.504120 (0.523788)	1.887828 / 1.541195 (0.346633)	2.131780 / 1.468490 (0.663290)	0.502847 / 4.584777 (-4.081930)	3.069498 / 3.745712 (-0.676215)	4.094774 / 5.269862 (-1.175088)	2.544004 / 4.565676 (-2.021673)	0.059540 / 0.424275 (-0.364735)	0.006501 / 0.007607 (-0.001106)	0.477218 / 0.226044 (0.251173)	4.764961 / 2.268929 (2.496032)	2.434594 / 55.444624 (-53.010030)	2.104833 / 6.876477 (-4.771644)	2.263059 / 2.142072 (0.120987)	0.591755 / 4.805227 (-4.213472)	0.131167 / 6.500664 (-6.369497)	0.061808 / 0.075469 (-0.013661)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.345364 / 1.841788 (-0.496424)	18.122584 / 8.074308 (10.048276)	13.318689 / 10.191392 (3.127297)	0.144526 / 0.680424 (-0.535898)	0.016997 / 0.534201 (-0.517204)	0.336036 / 0.579283 (-0.243247)	0.359532 / 0.434364 (-0.074832)	0.386945 / 0.540337 (-0.153392)	0.538659 / 1.386936 (-0.848277)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006088 / 0.011353 (-0.005265)	0.003684 / 0.011008 (-0.007324)	0.062340 / 0.038508 (0.023832)	0.058461 / 0.023109 (0.035352)	0.360134 / 0.275898 (0.084236)	0.393298 / 0.323480 (0.069818)	0.004664 / 0.007986 (-0.003322)	0.002909 / 0.004328 (-0.001420)	0.062668 / 0.004250 (0.058418)	0.050145 / 0.037052 (0.013092)	0.361897 / 0.258489 (0.103408)	0.402008 / 0.293841 (0.108167)	0.027491 / 0.128546 (-0.101055)	0.008113 / 0.075646 (-0.067534)	0.068114 / 0.419271 (-0.351157)	0.043303 / 0.043533 (-0.000230)	0.360569 / 0.255139 (0.105430)	0.387144 / 0.283200 (0.103944)	0.020194 / 0.141683 (-0.121489)	1.418066 / 1.452155 (-0.034089)	1.475640 / 1.492716 (-0.017076)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.200291 / 0.018006 (0.182285)	0.432298 / 0.000490 (0.431809)	0.003303 / 0.000200 (0.003103)	0.000075 / 0.000054 (0.000020)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027749 / 0.037411 (-0.009662)	0.081890 / 0.014526 (0.067364)	0.094319 / 0.176557 (-0.082238)	0.148646 / 0.737135 (-0.588490)	0.091830 / 0.296338 (-0.204509)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.433546 / 0.215209 (0.218337)	4.326855 / 2.077655 (2.249200)	2.230186 / 1.504120 (0.726066)	2.052524 / 1.541195 (0.511329)	2.117270 / 1.468490 (0.648779)	0.500331 / 4.584777 (-4.084446)	3.113662 / 3.745712 (-0.632050)	2.931540 / 5.269862 (-2.338322)	1.853615 / 4.565676 (-2.712062)	0.058250 / 0.424275 (-0.366025)	0.006546 / 0.007607 (-0.001061)	0.508850 / 0.226044 (0.282806)	5.081809 / 2.268929 (2.812880)	2.687037 / 55.444624 (-52.757588)	2.369317 / 6.876477 (-4.507160)	2.383549 / 2.142072 (0.241477)	0.587039 / 4.805227 (-4.218188)	0.125858 / 6.500664 (-6.374806)	0.062522 / 0.075469 (-0.012947)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.294929 / 1.841788 (-0.546858)	18.056312 / 8.074308 (9.982004)	13.755117 / 10.191392 (3.563725)	0.132037 / 0.680424 (-0.548387)	0.016866 / 0.534201 (-0.517335)	0.339040 / 0.579283 (-0.240243)	0.364371 / 0.434364 (-0.069993)	0.399533 / 0.540337 (-0.140804)	0.564524 / 1.386936 (-0.822412)

lhoestq

Cool !

(nit) there is one progress bar without description in load_dataset. When running

ds = load_dataset("lhoestq/demo1")

I get

Downloading data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1440.10it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 1706.39it/s]
Generating train split: 5 examples [00:00, 253.97 examples/s]
Generating test split: 5 examples [00:00, 2493.64 examples/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 572.29it/s]

mariosasko · 2023-07-12T15:07:46Z

@lhoestq This bar comes from:

datasets/src/datasets/builder.py

Lines 1156 to 1166 in b8067c0

    
           datasets = map_nested( 
        
               partial( 
        
                   self._build_single_dataset, 
        
                   run_post_process=run_post_process, 
        
                   verification_mode=verification_mode, 
        
                   in_memory=in_memory, 
        
               ), 
        
               split, 
        
               map_tuple=True, 
        
               disable_tqdm=not logging.is_progress_bar_enabled(), 
        
           )

Do you prefer not showing it or, e.g., having desc="Generating splits"?

lhoestq · 2023-07-12T15:14:16Z

No strong opinion. Since there is a "Generating" progress bar already, maybe it can be "Preparing splits" (ref to download_and_prepare)

github-actions · 2023-07-12T15:57:09Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006348 / 0.011353 (-0.005005)	0.003721 / 0.011008 (-0.007287)	0.084039 / 0.038508 (0.045531)	0.067627 / 0.023109 (0.044517)	0.308372 / 0.275898 (0.032474)	0.335131 / 0.323480 (0.011652)	0.005157 / 0.007986 (-0.002829)	0.003266 / 0.004328 (-0.001062)	0.065374 / 0.004250 (0.061124)	0.055550 / 0.037052 (0.018498)	0.314001 / 0.258489 (0.055512)	0.350510 / 0.293841 (0.056669)	0.030859 / 0.128546 (-0.097688)	0.008286 / 0.075646 (-0.067361)	0.287122 / 0.419271 (-0.132149)	0.051494 / 0.043533 (0.007961)	0.309868 / 0.255139 (0.054729)	0.325845 / 0.283200 (0.042645)	0.022622 / 0.141683 (-0.119061)	1.468730 / 1.452155 (0.016575)	1.547871 / 1.492716 (0.055155)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.202763 / 0.018006 (0.184757)	0.456403 / 0.000490 (0.455914)	0.003116 / 0.000200 (0.002916)	0.000079 / 0.000054 (0.000024)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.027297 / 0.037411 (-0.010114)	0.081204 / 0.014526 (0.066678)	0.094274 / 0.176557 (-0.082282)	0.154391 / 0.737135 (-0.582744)	0.094312 / 0.296338 (-0.202026)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.387382 / 0.215209 (0.172173)	3.865597 / 2.077655 (1.787943)	1.855959 / 1.504120 (0.351839)	1.685411 / 1.541195 (0.144216)	1.732127 / 1.468490 (0.263637)	0.482230 / 4.584777 (-4.102547)	3.664947 / 3.745712 (-0.080765)	5.114379 / 5.269862 (-0.155482)	3.102803 / 4.565676 (-1.462873)	0.056509 / 0.424275 (-0.367766)	0.007230 / 0.007607 (-0.000377)	0.456788 / 0.226044 (0.230744)	4.575831 / 2.268929 (2.306902)	2.335249 / 55.444624 (-53.109375)	2.003805 / 6.876477 (-4.872672)	2.141788 / 2.142072 (-0.000285)	0.577501 / 4.805227 (-4.227726)	0.130264 / 6.500664 (-6.370400)	0.058889 / 0.075469 (-0.016580)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.252673 / 1.841788 (-0.589115)	18.676897 / 8.074308 (10.602589)	13.988101 / 10.191392 (3.796709)	0.151376 / 0.680424 (-0.529048)	0.018104 / 0.534201 (-0.516097)	0.388413 / 0.579283 (-0.190870)	0.414841 / 0.434364 (-0.019523)	0.456078 / 0.540337 (-0.084259)	0.641715 / 1.386936 (-0.745221)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006315 / 0.011353 (-0.005038)	0.003847 / 0.011008 (-0.007162)	0.063989 / 0.038508 (0.025481)	0.068244 / 0.023109 (0.045135)	0.416201 / 0.275898 (0.140303)	0.438446 / 0.323480 (0.114966)	0.005820 / 0.007986 (-0.002166)	0.003165 / 0.004328 (-0.001163)	0.064143 / 0.004250 (0.059892)	0.056529 / 0.037052 (0.019477)	0.414916 / 0.258489 (0.156427)	0.450771 / 0.293841 (0.156930)	0.030611 / 0.128546 (-0.097935)	0.008289 / 0.075646 (-0.067357)	0.070725 / 0.419271 (-0.348546)	0.047998 / 0.043533 (0.004465)	0.405609 / 0.255139 (0.150470)	0.421895 / 0.283200 (0.138696)	0.022135 / 0.141683 (-0.119548)	1.444238 / 1.452155 (-0.007916)	1.515823 / 1.492716 (0.023107)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.227043 / 0.018006 (0.209037)	0.439732 / 0.000490 (0.439242)	0.001267 / 0.000200 (0.001067)	0.000070 / 0.000054 (0.000016)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029082 / 0.037411 (-0.008329)	0.086201 / 0.014526 (0.071675)	0.098653 / 0.176557 (-0.077903)	0.152574 / 0.737135 (-0.584561)	0.100696 / 0.296338 (-0.195642)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.411243 / 0.215209 (0.196034)	4.100170 / 2.077655 (2.022515)	2.118310 / 1.504120 (0.614190)	1.935646 / 1.541195 (0.394451)	1.970798 / 1.468490 (0.502307)	0.478635 / 4.584777 (-4.106142)	3.589396 / 3.745712 (-0.156316)	3.312462 / 5.269862 (-1.957399)	1.963081 / 4.565676 (-2.602595)	0.056392 / 0.424275 (-0.367883)	0.007134 / 0.007607 (-0.000473)	0.485131 / 0.226044 (0.259086)	4.838946 / 2.268929 (2.570017)	2.624550 / 55.444624 (-52.820075)	2.223046 / 6.876477 (-4.653431)	2.230642 / 2.142072 (0.088570)	0.594892 / 4.805227 (-4.210335)	0.130523 / 6.500664 (-6.370141)	0.059585 / 0.075469 (-0.015884)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.329941 / 1.841788 (-0.511847)	19.199057 / 8.074308 (11.124748)	14.166009 / 10.191392 (3.974617)	0.190595 / 0.680424 (-0.489829)	0.018419 / 0.534201 (-0.515782)	0.392031 / 0.579283 (-0.187252)	0.409395 / 0.434364 (-0.024969)	0.475930 / 0.540337 (-0.064408)	0.654412 / 1.386936 (-0.732524)

github-actions · 2023-07-12T16:26:44Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.007500 / 0.011353 (-0.003853)	0.004328 / 0.011008 (-0.006681)	0.086718 / 0.038508 (0.048209)	0.098638 / 0.023109 (0.075529)	0.335308 / 0.275898 (0.059409)	0.369163 / 0.323480 (0.045683)	0.005733 / 0.007986 (-0.002253)	0.003738 / 0.004328 (-0.000590)	0.066452 / 0.004250 (0.062202)	0.066245 / 0.037052 (0.029192)	0.337609 / 0.258489 (0.079120)	0.388584 / 0.293841 (0.094744)	0.031742 / 0.128546 (-0.096804)	0.008721 / 0.075646 (-0.066925)	0.290820 / 0.419271 (-0.128452)	0.053323 / 0.043533 (0.009790)	0.329192 / 0.255139 (0.074053)	0.350560 / 0.283200 (0.067360)	0.025402 / 0.141683 (-0.116281)	1.476174 / 1.452155 (0.024020)	1.578194 / 1.492716 (0.085478)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.256160 / 0.018006 (0.238154)	0.560315 / 0.000490 (0.559825)	0.005287 / 0.000200 (0.005088)	0.000094 / 0.000054 (0.000040)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.029164 / 0.037411 (-0.008247)	0.084881 / 0.014526 (0.070356)	0.100979 / 0.176557 (-0.075577)	0.156539 / 0.737135 (-0.580597)	0.101510 / 0.296338 (-0.194828)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.381138 / 0.215209 (0.165929)	3.791573 / 2.077655 (1.713918)	1.841954 / 1.504120 (0.337834)	1.672463 / 1.541195 (0.131268)	1.785769 / 1.468490 (0.317279)	0.483263 / 4.584777 (-4.101514)	3.617391 / 3.745712 (-0.128322)	5.607794 / 5.269862 (0.337933)	3.359530 / 4.565676 (-1.206147)	0.056826 / 0.424275 (-0.367449)	0.007375 / 0.007607 (-0.000232)	0.455853 / 0.226044 (0.229809)	4.548965 / 2.268929 (2.280037)	2.412716 / 55.444624 (-53.031908)	1.991456 / 6.876477 (-4.885021)	2.242851 / 2.142072 (0.100778)	0.573070 / 4.805227 (-4.232157)	0.134658 / 6.500664 (-6.366006)	0.061539 / 0.075469 (-0.013930)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.278306 / 1.841788 (-0.563481)	20.634317 / 8.074308 (12.560009)	15.164246 / 10.191392 (4.972854)	0.167487 / 0.680424 (-0.512937)	0.019006 / 0.534201 (-0.515195)	0.394617 / 0.579283 (-0.184666)	0.423385 / 0.434364 (-0.010979)	0.469968 / 0.540337 (-0.070370)	0.630058 / 1.386936 (-0.756878)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006793 / 0.011353 (-0.004559)	0.004260 / 0.011008 (-0.006748)	0.065398 / 0.038508 (0.026890)	0.077850 / 0.023109 (0.054741)	0.371754 / 0.275898 (0.095855)	0.400652 / 0.323480 (0.077172)	0.005729 / 0.007986 (-0.002256)	0.003660 / 0.004328 (-0.000669)	0.065119 / 0.004250 (0.060869)	0.060714 / 0.037052 (0.023661)	0.384592 / 0.258489 (0.126103)	0.412806 / 0.293841 (0.118965)	0.031865 / 0.128546 (-0.096681)	0.008807 / 0.075646 (-0.066839)	0.071156 / 0.419271 (-0.348115)	0.049571 / 0.043533 (0.006038)	0.367381 / 0.255139 (0.112242)	0.386713 / 0.283200 (0.103513)	0.024838 / 0.141683 (-0.116845)	1.492986 / 1.452155 (0.040831)	1.559243 / 1.492716 (0.066526)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.269737 / 0.018006 (0.251730)	0.565177 / 0.000490 (0.564687)	0.000404 / 0.000200 (0.000204)	0.000060 / 0.000054 (0.000006)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.031631 / 0.037411 (-0.005780)	0.087289 / 0.014526 (0.072764)	0.102798 / 0.176557 (-0.073759)	0.158977 / 0.737135 (-0.578158)	0.105495 / 0.296338 (-0.190843)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425067 / 0.215209 (0.209858)	4.243121 / 2.077655 (2.165466)	2.234567 / 1.504120 (0.730447)	2.070810 / 1.541195 (0.529615)	2.176802 / 1.468490 (0.708312)	0.484987 / 4.584777 (-4.099790)	3.647000 / 3.745712 (-0.098712)	3.574843 / 5.269862 (-1.695019)	2.092581 / 4.565676 (-2.473095)	0.057299 / 0.424275 (-0.366976)	0.007480 / 0.007607 (-0.000128)	0.507838 / 0.226044 (0.281794)	5.076594 / 2.268929 (2.807666)	2.718858 / 55.444624 (-52.725766)	2.362793 / 6.876477 (-4.513684)	2.451962 / 2.142072 (0.309890)	0.581355 / 4.805227 (-4.223872)	0.133723 / 6.500664 (-6.366941)	0.061896 / 0.075469 (-0.013573)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.325814 / 1.841788 (-0.515974)	20.614502 / 8.074308 (12.540194)	14.769422 / 10.191392 (4.578029)	0.193797 / 0.680424 (-0.486627)	0.018379 / 0.534201 (-0.515822)	0.394153 / 0.579283 (-0.185130)	0.409585 / 0.434364 (-0.024779)	0.479107 / 0.540337 (-0.061231)	0.668397 / 1.386936 (-0.718539)

mariosasko · 2023-07-12T17:19:20Z

In the end, I decided to remove the progress bar to avoid having it displayed when loading a cached dataset.

github-actions · 2023-07-12T17:28:25Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006673 / 0.011353 (-0.004680)	0.004162 / 0.011008 (-0.006846)	0.084017 / 0.038508 (0.045509)	0.079536 / 0.023109 (0.056426)	0.313594 / 0.275898 (0.037695)	0.349200 / 0.323480 (0.025720)	0.005544 / 0.007986 (-0.002441)	0.003472 / 0.004328 (-0.000857)	0.064742 / 0.004250 (0.060491)	0.056857 / 0.037052 (0.019805)	0.318635 / 0.258489 (0.060146)	0.354378 / 0.293841 (0.060537)	0.030856 / 0.128546 (-0.097690)	0.008759 / 0.075646 (-0.066887)	0.287760 / 0.419271 (-0.131511)	0.052307 / 0.043533 (0.008775)	0.316396 / 0.255139 (0.061257)	0.351408 / 0.283200 (0.068208)	0.024914 / 0.141683 (-0.116769)	1.484592 / 1.452155 (0.032437)	1.560662 / 1.492716 (0.067945)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.280938 / 0.018006 (0.262932)	0.580236 / 0.000490 (0.579747)	0.003369 / 0.000200 (0.003169)	0.000090 / 0.000054 (0.000036)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.028736 / 0.037411 (-0.008675)	0.082916 / 0.014526 (0.068390)	0.097761 / 0.176557 (-0.078796)	0.153515 / 0.737135 (-0.583620)	0.099282 / 0.296338 (-0.197057)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.401244 / 0.215209 (0.186035)	4.019866 / 2.077655 (1.942211)	2.029642 / 1.504120 (0.525522)	1.849591 / 1.541195 (0.308396)	1.946829 / 1.468490 (0.478339)	0.479750 / 4.584777 (-4.105027)	3.482822 / 3.745712 (-0.262890)	3.955859 / 5.269862 (-1.314003)	2.370747 / 4.565676 (-2.194930)	0.056905 / 0.424275 (-0.367370)	0.007319 / 0.007607 (-0.000288)	0.485310 / 0.226044 (0.259266)	4.858228 / 2.268929 (2.589299)	2.500476 / 55.444624 (-52.944148)	2.171156 / 6.876477 (-4.705320)	2.427266 / 2.142072 (0.285194)	0.570199 / 4.805227 (-4.235029)	0.130855 / 6.500664 (-6.369809)	0.060269 / 0.075469 (-0.015200)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.258044 / 1.841788 (-0.583743)	20.218657 / 8.074308 (12.144349)	13.597970 / 10.191392 (3.406578)	0.167656 / 0.680424 (-0.512768)	0.018137 / 0.534201 (-0.516064)	0.395309 / 0.579283 (-0.183975)	0.406325 / 0.434364 (-0.028039)	0.467457 / 0.540337 (-0.072880)	0.613636 / 1.386936 (-0.773300)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.006846 / 0.011353 (-0.004507)	0.004207 / 0.011008 (-0.006802)	0.064525 / 0.038508 (0.026017)	0.081329 / 0.023109 (0.058220)	0.399838 / 0.275898 (0.123940)	0.431305 / 0.323480 (0.107825)	0.005859 / 0.007986 (-0.002127)	0.003568 / 0.004328 (-0.000760)	0.065262 / 0.004250 (0.061011)	0.064796 / 0.037052 (0.027744)	0.406858 / 0.258489 (0.148369)	0.440971 / 0.293841 (0.147130)	0.031421 / 0.128546 (-0.097125)	0.008777 / 0.075646 (-0.066870)	0.071418 / 0.419271 (-0.347853)	0.049263 / 0.043533 (0.005730)	0.384279 / 0.255139 (0.129140)	0.410745 / 0.283200 (0.127546)	0.024467 / 0.141683 (-0.117216)	1.522379 / 1.452155 (0.070224)	1.581636 / 1.492716 (0.088920)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.276161 / 0.018006 (0.258155)	0.548842 / 0.000490 (0.548352)	0.004523 / 0.000200 (0.004324)	0.000098 / 0.000054 (0.000043)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.030747 / 0.037411 (-0.006664)	0.087493 / 0.014526 (0.072967)	0.106563 / 0.176557 (-0.069993)	0.162949 / 0.737135 (-0.574186)	0.105303 / 0.296338 (-0.191036)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.425854 / 0.215209 (0.210645)	4.244797 / 2.077655 (2.167142)	2.269006 / 1.504120 (0.764886)	2.097428 / 1.541195 (0.556234)	2.181038 / 1.468490 (0.712548)	0.477286 / 4.584777 (-4.107491)	3.591452 / 3.745712 (-0.154260)	3.481281 / 5.269862 (-1.788580)	2.066895 / 4.565676 (-2.498782)	0.056576 / 0.424275 (-0.367699)	0.007409 / 0.007607 (-0.000199)	0.498411 / 0.226044 (0.272367)	4.994873 / 2.268929 (2.725945)	2.749148 / 55.444624 (-52.695476)	2.378544 / 6.876477 (-4.497932)	2.452859 / 2.142072 (0.310786)	0.571340 / 4.805227 (-4.233887)	0.132174 / 6.500664 (-6.368490)	0.061507 / 0.075469 (-0.013962)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.370773 / 1.841788 (-0.471015)	20.493342 / 8.074308 (12.419034)	14.809886 / 10.191392 (4.618494)	0.175730 / 0.680424 (-0.504693)	0.018617 / 0.534201 (-0.515583)	0.393808 / 0.579283 (-0.185476)	0.416419 / 0.434364 (-0.017945)	0.477183 / 0.540337 (-0.063155)	0.668060 / 1.386936 (-0.718876)

davidgilbertson · 2023-07-12T19:34:14Z

Nice one :)

mariosasko added 2 commits July 11, 2023 20:27

Improve logging

1cb7ae5

Typo

0160475

Nit

b2fc21e

Fix tests

b85b115

Fix

64b811c

mariosasko marked this pull request as ready for review July 12, 2023 11:41

mariosasko requested a review from lhoestq July 12, 2023 11:41

lhoestq approved these changes Jul 12, 2023

View reviewed changes

Nit

42fdfbd

Remove bar

b2d8922

mariosasko merged commit 2de7a2a into main Jul 12, 2023

mariosasko deleted the improve-logging branch July 12, 2023 17:19

mariosasko mentioned this pull request Jul 12, 2023

Modify levels of some logging messages #5934

Closed

mariosasko mentioned this pull request Jul 21, 2023

Make all print statements optional #5647

Closed

Improve logging #6019

Improve logging #6019

Uh oh!

Conversation

mariosasko commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Jul 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Jul 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Jul 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Jul 11, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

mariosasko commented Jul 12, 2023

Uh oh!

lhoestq commented Jul 12, 2023

Uh oh!

github-actions bot commented Jul 12, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

mariosasko commented Jul 11, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Jul 11, 2023 •

edited

Loading