Refactor `dill` logic #6454

mariosasko · 2023-11-27T20:01:25Z

Refactor the dill logic to make it easier to maintain (and fix some issues along the way)

It makes the following improvements to the serialization API:

consistent order of a dict's keys
support for hashing torch.compile-ed modules and functions
deprecates datasets.fingerprint.hashregister as the hashregister-ed reducers are never invoked anyways (does not support nested data as pickle/dill do)

~~TODO: optimize hashing of pa.Table and datasets.table.Table~~ The pa_array.to_string approach is faster for large arrays because it outputs the first 10 and last 10 elements (by default). The problem is that this can produce identical hashes for non-identical arrays if their differing elements get ellipsed...

Fix #6440, fix #5839

…ctor

github-actions · 2023-11-27T20:15:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005490 / 0.011353 (-0.005863)	0.003554 / 0.011008 (-0.007454)	0.062183 / 0.038508 (0.023675)	0.053093 / 0.023109 (0.029984)	0.245370 / 0.275898 (-0.030528)	0.271637 / 0.323480 (-0.051842)	0.002997 / 0.007986 (-0.004989)	0.002811 / 0.004328 (-0.001517)	0.047874 / 0.004250 (0.043623)	0.039673 / 0.037052 (0.002620)	0.253219 / 0.258489 (-0.005271)	0.280438 / 0.293841 (-0.013403)	0.028393 / 0.128546 (-0.100153)	0.010914 / 0.075646 (-0.064732)	0.207491 / 0.419271 (-0.211781)	0.037565 / 0.043533 (-0.005968)	0.252382 / 0.255139 (-0.002757)	0.272204 / 0.283200 (-0.010995)	0.019007 / 0.141683 (-0.122676)	1.099767 / 1.452155 (-0.352388)	1.173220 / 1.492716 (-0.319496)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.098777 / 0.018006 (0.080771)	0.325912 / 0.000490 (0.325422)	0.000214 / 0.000200 (0.000014)	0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018815 / 0.037411 (-0.018596)	0.070031 / 0.014526 (0.055506)	0.075395 / 0.176557 (-0.101162)	0.122633 / 0.737135 (-0.614502)	0.077621 / 0.296338 (-0.218718)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.290830 / 0.215209 (0.075621)	2.869214 / 2.077655 (0.791559)	1.507337 / 1.504120 (0.003217)	1.351391 / 1.541195 (-0.189804)	1.386642 / 1.468490 (-0.081848)	0.570318 / 4.584777 (-4.014459)	2.423442 / 3.745712 (-1.322270)	2.897812 / 5.269862 (-2.372050)	1.796458 / 4.565676 (-2.769219)	0.063649 / 0.424275 (-0.360626)	0.005038 / 0.007607 (-0.002570)	0.357819 / 0.226044 (0.131774)	3.535478 / 2.268929 (1.266549)	1.831764 / 55.444624 (-53.612861)	1.545035 / 6.876477 (-5.331442)	1.585919 / 2.142072 (-0.556154)	0.643333 / 4.805227 (-4.161894)	0.120319 / 6.500664 (-6.380345)	0.043031 / 0.075469 (-0.032438)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.981155 / 1.841788 (-0.860633)	12.136069 / 8.074308 (4.061760)	10.579923 / 10.191392 (0.388531)	0.152963 / 0.680424 (-0.527461)	0.014783 / 0.534201 (-0.519418)	0.289177 / 0.579283 (-0.290106)	0.271784 / 0.434364 (-0.162580)	0.322381 / 0.540337 (-0.217956)	0.420034 / 1.386936 (-0.966902)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005315 / 0.011353 (-0.006038)	0.003584 / 0.011008 (-0.007424)	0.048596 / 0.038508 (0.010088)	0.055940 / 0.023109 (0.032830)	0.277687 / 0.275898 (0.001789)	0.301545 / 0.323480 (-0.021935)	0.004150 / 0.007986 (-0.003836)	0.002699 / 0.004328 (-0.001629)	0.047661 / 0.004250 (0.043410)	0.040618 / 0.037052 (0.003565)	0.279173 / 0.258489 (0.020684)	0.306105 / 0.293841 (0.012264)	0.030099 / 0.128546 (-0.098447)	0.010784 / 0.075646 (-0.064862)	0.057418 / 0.419271 (-0.361853)	0.032632 / 0.043533 (-0.010901)	0.276064 / 0.255139 (0.020925)	0.307194 / 0.283200 (0.023995)	0.017416 / 0.141683 (-0.124267)	1.107749 / 1.452155 (-0.344406)	1.161104 / 1.492716 (-0.331612)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.102395 / 0.018006 (0.084389)	0.316933 / 0.000490 (0.316443)	0.000246 / 0.000200 (0.000046)	0.000042 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022833 / 0.037411 (-0.014579)	0.069372 / 0.014526 (0.054846)	0.082139 / 0.176557 (-0.094418)	0.121666 / 0.737135 (-0.615469)	0.084039 / 0.296338 (-0.212300)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.298775 / 0.215209 (0.083566)	2.973898 / 2.077655 (0.896244)	1.614436 / 1.504120 (0.110316)	1.476112 / 1.541195 (-0.065083)	1.502031 / 1.468490 (0.033541)	0.580626 / 4.584777 (-4.004151)	2.493428 / 3.745712 (-1.252285)	2.931050 / 5.269862 (-2.338811)	1.823603 / 4.565676 (-2.742073)	0.064736 / 0.424275 (-0.359539)	0.004963 / 0.007607 (-0.002644)	0.355096 / 0.226044 (0.129052)	3.522801 / 2.268929 (1.253872)	1.968690 / 55.444624 (-53.475935)	1.698624 / 6.876477 (-5.177853)	1.714166 / 2.142072 (-0.427906)	0.681734 / 4.805227 (-4.123493)	0.118940 / 6.500664 (-6.381724)	0.041960 / 0.075469 (-0.033509)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.985311 / 1.841788 (-0.856476)	12.785393 / 8.074308 (4.711085)	11.289459 / 10.191392 (1.098067)	0.145297 / 0.680424 (-0.535127)	0.016125 / 0.534201 (-0.518076)	0.289445 / 0.579283 (-0.289838)	0.278974 / 0.434364 (-0.155390)	0.322456 / 0.540337 (-0.217881)	0.418218 / 1.386936 (-0.968718)

HuggingFaceDocBuilderDev · 2023-11-27T20:35:55Z

The documentation is not available anymore as the PR was closed or merged.

github-actions · 2023-11-28T00:53:46Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005142 / 0.011353 (-0.006211)	0.004180 / 0.011008 (-0.006829)	0.062647 / 0.038508 (0.024139)	0.055072 / 0.023109 (0.031962)	0.254681 / 0.275898 (-0.021217)	0.282650 / 0.323480 (-0.040830)	0.003950 / 0.007986 (-0.004035)	0.002862 / 0.004328 (-0.001466)	0.048420 / 0.004250 (0.044170)	0.038447 / 0.037052 (0.001394)	0.258160 / 0.258489 (-0.000329)	0.288596 / 0.293841 (-0.005245)	0.027898 / 0.128546 (-0.100648)	0.011165 / 0.075646 (-0.064482)	0.206844 / 0.419271 (-0.212427)	0.036312 / 0.043533 (-0.007221)	0.257957 / 0.255139 (0.002819)	0.277387 / 0.283200 (-0.005812)	0.018205 / 0.141683 (-0.123478)	1.109870 / 1.452155 (-0.342284)	1.175005 / 1.492716 (-0.317712)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.096692 / 0.018006 (0.078686)	0.307463 / 0.000490 (0.306973)	0.000218 / 0.000200 (0.000018)	0.000042 / 0.000054 (-0.000012)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018602 / 0.037411 (-0.018809)	0.061489 / 0.014526 (0.046964)	0.072936 / 0.176557 (-0.103620)	0.119863 / 0.737135 (-0.617272)	0.073983 / 0.296338 (-0.222355)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.291444 / 0.215209 (0.076235)	2.849024 / 2.077655 (0.771369)	1.533121 / 1.504120 (0.029001)	1.402148 / 1.541195 (-0.139046)	1.406397 / 1.468490 (-0.062094)	0.564241 / 4.584777 (-4.020536)	2.402052 / 3.745712 (-1.343660)	2.772639 / 5.269862 (-2.497223)	1.732342 / 4.565676 (-2.833334)	0.062361 / 0.424275 (-0.361914)	0.004945 / 0.007607 (-0.002662)	0.355841 / 0.226044 (0.129797)	3.426931 / 2.268929 (1.158003)	1.865412 / 55.444624 (-53.579212)	1.592628 / 6.876477 (-5.283849)	1.662364 / 2.142072 (-0.479708)	0.653278 / 4.805227 (-4.151949)	0.118626 / 6.500664 (-6.382038)	0.042961 / 0.075469 (-0.032508)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.956279 / 1.841788 (-0.885509)	11.635540 / 8.074308 (3.561232)	10.719590 / 10.191392 (0.528198)	0.130015 / 0.680424 (-0.550409)	0.014424 / 0.534201 (-0.519777)	0.288135 / 0.579283 (-0.291148)	0.270819 / 0.434364 (-0.163545)	0.320238 / 0.540337 (-0.220099)	0.421044 / 1.386936 (-0.965892)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005201 / 0.011353 (-0.006152)	0.003467 / 0.011008 (-0.007541)	0.048939 / 0.038508 (0.010431)	0.051841 / 0.023109 (0.028732)	0.273708 / 0.275898 (-0.002190)	0.293491 / 0.323480 (-0.029988)	0.004830 / 0.007986 (-0.003156)	0.002696 / 0.004328 (-0.001632)	0.047727 / 0.004250 (0.043476)	0.041319 / 0.037052 (0.004266)	0.273837 / 0.258489 (0.015348)	0.309860 / 0.293841 (0.016019)	0.029054 / 0.128546 (-0.099492)	0.010410 / 0.075646 (-0.065237)	0.058139 / 0.419271 (-0.361133)	0.032682 / 0.043533 (-0.010850)	0.273244 / 0.255139 (0.018105)	0.291579 / 0.283200 (0.008380)	0.018262 / 0.141683 (-0.123421)	1.144590 / 1.452155 (-0.307565)	1.202474 / 1.492716 (-0.290243)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.097110 / 0.018006 (0.079104)	0.307344 / 0.000490 (0.306854)	0.000229 / 0.000200 (0.000029)	0.000045 / 0.000054 (-0.000009)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.022263 / 0.037411 (-0.015148)	0.070140 / 0.014526 (0.055614)	0.081251 / 0.176557 (-0.095306)	0.120839 / 0.737135 (-0.616297)	0.083312 / 0.296338 (-0.213026)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.297381 / 0.215209 (0.082172)	2.895530 / 2.077655 (0.817875)	1.608442 / 1.504120 (0.104322)	1.476237 / 1.541195 (-0.064958)	1.491306 / 1.468490 (0.022816)	0.567272 / 4.584777 (-4.017505)	2.463543 / 3.745712 (-1.282170)	2.814764 / 5.269862 (-2.455098)	1.725845 / 4.565676 (-2.839831)	0.064149 / 0.424275 (-0.360126)	0.004953 / 0.007607 (-0.002654)	0.359629 / 0.226044 (0.133585)	3.482414 / 2.268929 (1.213486)	1.949897 / 55.444624 (-53.494727)	1.677383 / 6.876477 (-5.199094)	1.683655 / 2.142072 (-0.458418)	0.645671 / 4.805227 (-4.159557)	0.115612 / 6.500664 (-6.385053)	0.041013 / 0.075469 (-0.034456)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.967843 / 1.841788 (-0.873945)	12.376877 / 8.074308 (4.302569)	10.988174 / 10.191392 (0.796782)	0.134660 / 0.680424 (-0.545764)	0.015801 / 0.534201 (-0.518400)	0.288699 / 0.579283 (-0.290584)	0.284887 / 0.434364 (-0.149477)	0.322000 / 0.540337 (-0.218337)	0.412360 / 1.386936 (-0.974576)

…ctor

github-actions · 2023-11-28T01:28:31Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005407 / 0.011353 (-0.005946)	0.003496 / 0.011008 (-0.007512)	0.062730 / 0.038508 (0.024222)	0.051882 / 0.023109 (0.028773)	0.244766 / 0.275898 (-0.031132)	0.257963 / 0.323480 (-0.065516)	0.002894 / 0.007986 (-0.005092)	0.002567 / 0.004328 (-0.001761)	0.048756 / 0.004250 (0.044506)	0.039024 / 0.037052 (0.001971)	0.247303 / 0.258489 (-0.011186)	0.278341 / 0.293841 (-0.015500)	0.026725 / 0.128546 (-0.101821)	0.010577 / 0.075646 (-0.065069)	0.210483 / 0.419271 (-0.208789)	0.035230 / 0.043533 (-0.008303)	0.246125 / 0.255139 (-0.009014)	0.264039 / 0.283200 (-0.019160)	0.019881 / 0.141683 (-0.121802)	1.113475 / 1.452155 (-0.338679)	1.149606 / 1.492716 (-0.343110)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092946 / 0.018006 (0.074940)	0.299985 / 0.000490 (0.299495)	0.000215 / 0.000200 (0.000016)	0.000050 / 0.000054 (-0.000005)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018421 / 0.037411 (-0.018991)	0.060531 / 0.014526 (0.046005)	0.074459 / 0.176557 (-0.102098)	0.120369 / 0.737135 (-0.616766)	0.075505 / 0.296338 (-0.220833)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.289497 / 0.215209 (0.074288)	2.783139 / 2.077655 (0.705485)	1.482533 / 1.504120 (-0.021587)	1.371013 / 1.541195 (-0.170182)	1.379114 / 1.468490 (-0.089376)	0.563953 / 4.584777 (-4.020824)	2.389996 / 3.745712 (-1.355716)	2.788067 / 5.269862 (-2.481795)	1.751772 / 4.565676 (-2.813904)	0.062680 / 0.424275 (-0.361595)	0.004901 / 0.007607 (-0.002706)	0.365193 / 0.226044 (0.139149)	3.389181 / 2.268929 (1.120252)	1.861659 / 55.444624 (-53.582965)	1.558899 / 6.876477 (-5.317577)	1.591079 / 2.142072 (-0.550993)	0.648300 / 4.805227 (-4.156927)	0.117486 / 6.500664 (-6.383178)	0.041961 / 0.075469 (-0.033508)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.944391 / 1.841788 (-0.897396)	11.500823 / 8.074308 (3.426515)	10.580430 / 10.191392 (0.389038)	0.142845 / 0.680424 (-0.537579)	0.014305 / 0.534201 (-0.519896)	0.290723 / 0.579283 (-0.288560)	0.266206 / 0.434364 (-0.168158)	0.325482 / 0.540337 (-0.214856)	0.416224 / 1.386936 (-0.970712)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005363 / 0.011353 (-0.005990)	0.003548 / 0.011008 (-0.007460)	0.048704 / 0.038508 (0.010196)	0.051025 / 0.023109 (0.027916)	0.273037 / 0.275898 (-0.002861)	0.297148 / 0.323480 (-0.026332)	0.003985 / 0.007986 (-0.004001)	0.002739 / 0.004328 (-0.001590)	0.048108 / 0.004250 (0.043857)	0.040244 / 0.037052 (0.003191)	0.277825 / 0.258489 (0.019336)	0.303704 / 0.293841 (0.009863)	0.029460 / 0.128546 (-0.099086)	0.010428 / 0.075646 (-0.065218)	0.057022 / 0.419271 (-0.362249)	0.032711 / 0.043533 (-0.010822)	0.274462 / 0.255139 (0.019323)	0.293499 / 0.283200 (0.010299)	0.018266 / 0.141683 (-0.123417)	1.158049 / 1.452155 (-0.294106)	1.170097 / 1.492716 (-0.322620)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093412 / 0.018006 (0.075406)	0.301538 / 0.000490 (0.301049)	0.000222 / 0.000200 (0.000022)	0.000051 / 0.000054 (-0.000003)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021698 / 0.037411 (-0.015713)	0.068735 / 0.014526 (0.054209)	0.083010 / 0.176557 (-0.093546)	0.127491 / 0.737135 (-0.609644)	0.083005 / 0.296338 (-0.213333)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.298299 / 0.215209 (0.083090)	2.894209 / 2.077655 (0.816554)	1.597455 / 1.504120 (0.093335)	1.472953 / 1.541195 (-0.068241)	1.491553 / 1.468490 (0.023063)	0.556566 / 4.584777 (-4.028211)	2.419429 / 3.745712 (-1.326283)	2.788706 / 5.269862 (-2.481156)	1.759888 / 4.565676 (-2.805789)	0.062535 / 0.424275 (-0.361740)	0.004959 / 0.007607 (-0.002648)	0.345226 / 0.226044 (0.119182)	3.438539 / 2.268929 (1.169611)	1.943842 / 55.444624 (-53.500782)	1.661080 / 6.876477 (-5.215397)	1.687632 / 2.142072 (-0.454440)	0.639971 / 4.805227 (-4.165256)	0.116012 / 6.500664 (-6.384652)	0.041723 / 0.075469 (-0.033746)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.965143 / 1.841788 (-0.876645)	12.086547 / 8.074308 (4.012238)	10.708787 / 10.191392 (0.517395)	0.129506 / 0.680424 (-0.550918)	0.015254 / 0.534201 (-0.518947)	0.288326 / 0.579283 (-0.290957)	0.271976 / 0.434364 (-0.162388)	0.328402 / 0.540337 (-0.211936)	0.418102 / 1.386936 (-0.968834)

github-actions · 2023-11-28T14:41:08Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005375 / 0.011353 (-0.005978)	0.003530 / 0.011008 (-0.007478)	0.062521 / 0.038508 (0.024013)	0.051514 / 0.023109 (0.028405)	0.241623 / 0.275898 (-0.034275)	0.269054 / 0.323480 (-0.054426)	0.002877 / 0.007986 (-0.005109)	0.002724 / 0.004328 (-0.001605)	0.049045 / 0.004250 (0.044794)	0.038560 / 0.037052 (0.001507)	0.248437 / 0.258489 (-0.010052)	0.276762 / 0.293841 (-0.017079)	0.027522 / 0.128546 (-0.101024)	0.010817 / 0.075646 (-0.064829)	0.208686 / 0.419271 (-0.210585)	0.035818 / 0.043533 (-0.007715)	0.249398 / 0.255139 (-0.005741)	0.268288 / 0.283200 (-0.014911)	0.019039 / 0.141683 (-0.122644)	1.135115 / 1.452155 (-0.317040)	1.195531 / 1.492716 (-0.297185)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093126 / 0.018006 (0.075120)	0.301028 / 0.000490 (0.300539)	0.000222 / 0.000200 (0.000023)	0.000062 / 0.000054 (0.000007)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018385 / 0.037411 (-0.019027)	0.060902 / 0.014526 (0.046376)	0.073168 / 0.176557 (-0.103389)	0.119216 / 0.737135 (-0.617919)	0.074225 / 0.296338 (-0.222114)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.283749 / 0.215209 (0.068540)	2.741609 / 2.077655 (0.663954)	1.483439 / 1.504120 (-0.020681)	1.352896 / 1.541195 (-0.188299)	1.378824 / 1.468490 (-0.089667)	0.548731 / 4.584777 (-4.036046)	2.342717 / 3.745712 (-1.402995)	2.791592 / 5.269862 (-2.478269)	1.740605 / 4.565676 (-2.825071)	0.062059 / 0.424275 (-0.362216)	0.005028 / 0.007607 (-0.002579)	0.339205 / 0.226044 (0.113161)	3.353386 / 2.268929 (1.084458)	1.785717 / 55.444624 (-53.658907)	1.523390 / 6.876477 (-5.353086)	1.556999 / 2.142072 (-0.585073)	0.636745 / 4.805227 (-4.168483)	0.115821 / 6.500664 (-6.384843)	0.042200 / 0.075469 (-0.033269)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.948678 / 1.841788 (-0.893110)	11.588670 / 8.074308 (3.514362)	10.897130 / 10.191392 (0.705738)	0.140068 / 0.680424 (-0.540356)	0.014565 / 0.534201 (-0.519636)	0.286336 / 0.579283 (-0.292947)	0.265292 / 0.434364 (-0.169072)	0.324146 / 0.540337 (-0.216192)	0.413463 / 1.386936 (-0.973473)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005187 / 0.011353 (-0.006165)	0.003471 / 0.011008 (-0.007537)	0.048968 / 0.038508 (0.010460)	0.051285 / 0.023109 (0.028176)	0.283286 / 0.275898 (0.007388)	0.307046 / 0.323480 (-0.016434)	0.004017 / 0.007986 (-0.003969)	0.002655 / 0.004328 (-0.001673)	0.047762 / 0.004250 (0.043512)	0.039855 / 0.037052 (0.002803)	0.283101 / 0.258489 (0.024612)	0.312905 / 0.293841 (0.019064)	0.028188 / 0.128546 (-0.100358)	0.010849 / 0.075646 (-0.064797)	0.058112 / 0.419271 (-0.361159)	0.032163 / 0.043533 (-0.011369)	0.280825 / 0.255139 (0.025686)	0.300946 / 0.283200 (0.017747)	0.017409 / 0.141683 (-0.124274)	1.127360 / 1.452155 (-0.324795)	1.180409 / 1.492716 (-0.312307)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.093186 / 0.018006 (0.075180)	0.300827 / 0.000490 (0.300338)	0.000220 / 0.000200 (0.000020)	0.000052 / 0.000054 (-0.000002)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021560 / 0.037411 (-0.015851)	0.069158 / 0.014526 (0.054632)	0.080953 / 0.176557 (-0.095603)	0.119071 / 0.737135 (-0.618064)	0.082817 / 0.296338 (-0.213521)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.307259 / 0.215209 (0.092050)	2.996058 / 2.077655 (0.918404)	1.627406 / 1.504120 (0.123286)	1.500715 / 1.541195 (-0.040480)	1.524278 / 1.468490 (0.055788)	0.569711 / 4.584777 (-4.015066)	2.436132 / 3.745712 (-1.309580)	2.796995 / 5.269862 (-2.472866)	1.760701 / 4.565676 (-2.804975)	0.063521 / 0.424275 (-0.360754)	0.004909 / 0.007607 (-0.002698)	0.359129 / 0.226044 (0.133085)	3.567278 / 2.268929 (1.298349)	2.013821 / 55.444624 (-53.430804)	1.708021 / 6.876477 (-5.168456)	1.738959 / 2.142072 (-0.403114)	0.648620 / 4.805227 (-4.156607)	0.122016 / 6.500664 (-6.378648)	0.041802 / 0.075469 (-0.033667)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.985208 / 1.841788 (-0.856579)	12.307785 / 8.074308 (4.233477)	10.587262 / 10.191392 (0.395870)	0.130468 / 0.680424 (-0.549956)	0.014912 / 0.534201 (-0.519289)	0.293822 / 0.579283 (-0.285461)	0.283021 / 0.434364 (-0.151343)	0.329560 / 0.540337 (-0.210777)	0.424741 / 1.386936 (-0.962195)

lhoestq

Awesome ! Thanks for cleaning this and for the improvements :)

mariosasko added 4 commits November 21, 2023 19:06

Improve dill API

d3c7e1a

Merge branch 'main' of github.com:huggingface/datasets into dill-refa…

8108c69

…ctor

Cleaner/faster implementatio

4413b2d

Deprecate hashregister API

ad7447d

mariosasko mentioned this pull request Nov 27, 2023

Add logic for hashing modules/functions optimized with torch.compile #5867

Closed

Import for backward compatibility

66cef09

Fix?

148454d

mariosasko added 2 commits November 28, 2023 02:22

Skip torch.compile tests on windows

0f2b39c

Merge branch 'main' of github.com:huggingface/datasets into dill-refa…

18b6f13

…ctor

Nit

04426d9

mariosasko marked this pull request as ready for review November 28, 2023 15:16

mariosasko requested a review from lhoestq November 28, 2023 15:17

lhoestq approved these changes Nov 28, 2023

View reviewed changes

mariosasko merged commit 15b50e9 into main Nov 28, 2023

mariosasko deleted the dill-refactor branch November 28, 2023 16:29

lhoestq mentioned this pull request Dec 19, 2023

Cache backward compatibility with 2.15.0 #6514

Merged

albertvillanova mentioned this pull request Feb 9, 2024

Cannot load the dataset go_emotions #6655

Open

Refactor dill logic #6454

Refactor dill logic #6454

Uh oh!

Conversation

mariosasko commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 27, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

HuggingFaceDocBuilderDev commented Nov 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

github-actions bot commented Nov 28, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Refactor `dill` logic #6454

Refactor `dill` logic #6454

mariosasko commented Nov 27, 2023 •

edited

Loading

HuggingFaceDocBuilderDev commented Nov 27, 2023 •

edited

Loading