Remove `Table.getstate` and `Table.setstate` #6444

LZHgrla · 2023-11-22T17:55:10Z

When using distributed training, the code of os.remove(filename) may be executed separately by each rank, leading to FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmprxxxxxxx.arrow'

from torch import distributed as dist

if dist.get_rank() == 0:
    dataset = process_dataset(*args, **kwargs)
    objects = [dataset]
else:
    objects = [None]
dist.broadcast_object_list(objects, src=0)
dataset = objects[0]

mariosasko · 2023-11-22T18:34:05Z

Thanks for working on this! The issue with pickling objects larger than 4GB seems to be patched in Python 3.8 (the minimal supported version was 3.6 at the time of implementing this), so a simple solution would be removing the Table.__setstate__ and Table.__getstate__ overrides.

LZHgrla · 2023-11-23T05:47:26Z

@mariosasko
Cool!
I removed these overrides, and it worked.

All modifications are committed. Ready for review!

HuggingFaceDocBuilderDev · 2023-11-23T05:57:01Z

The documentation is not available anymore as the PR was closed or merged.

mariosasko

LGTM, thanks!

github-actions · 2023-11-23T15:19:42Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005251 / 0.011353 (-0.006102)	0.003804 / 0.011008 (-0.007204)	0.063143 / 0.038508 (0.024635)	0.059409 / 0.023109 (0.036300)	0.255319 / 0.275898 (-0.020579)	0.279194 / 0.323480 (-0.044285)	0.004643 / 0.007986 (-0.003343)	0.002560 / 0.004328 (-0.001768)	0.047490 / 0.004250 (0.043240)	0.039034 / 0.037052 (0.001982)	0.257352 / 0.258489 (-0.001137)	0.293029 / 0.293841 (-0.000812)	0.027548 / 0.128546 (-0.100998)	0.011307 / 0.075646 (-0.064339)	0.210325 / 0.419271 (-0.208946)	0.035161 / 0.043533 (-0.008372)	0.253491 / 0.255139 (-0.001648)	0.272085 / 0.283200 (-0.011115)	0.018924 / 0.141683 (-0.122759)	1.111148 / 1.452155 (-0.341007)	1.178076 / 1.492716 (-0.314641)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.092447 / 0.018006 (0.074441)	0.303680 / 0.000490 (0.303190)	0.000208 / 0.000200 (0.000008)	0.000051 / 0.000054 (-0.000004)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.019087 / 0.037411 (-0.018325)	0.062663 / 0.014526 (0.048137)	0.074651 / 0.176557 (-0.101905)	0.121334 / 0.737135 (-0.615802)	0.076703 / 0.296338 (-0.219636)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.286505 / 0.215209 (0.071295)	2.804942 / 2.077655 (0.727287)	1.481930 / 1.504120 (-0.022190)	1.369485 / 1.541195 (-0.171710)	1.424467 / 1.468490 (-0.044023)	0.556810 / 4.584777 (-4.027967)	2.416338 / 3.745712 (-1.329374)	2.901869 / 5.269862 (-2.367992)	1.827007 / 4.565676 (-2.738669)	0.062252 / 0.424275 (-0.362024)	0.005076 / 0.007607 (-0.002531)	0.343850 / 0.226044 (0.117805)	3.377611 / 2.268929 (1.108683)	1.860214 / 55.444624 (-53.584410)	1.595146 / 6.876477 (-5.281331)	1.627234 / 2.142072 (-0.514838)	0.651027 / 4.805227 (-4.154200)	0.119214 / 6.500664 (-6.381450)	0.043342 / 0.075469 (-0.032127)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.942863 / 1.841788 (-0.898924)	12.484633 / 8.074308 (4.410324)	10.560668 / 10.191392 (0.369276)	0.144647 / 0.680424 (-0.535777)	0.014734 / 0.534201 (-0.519466)	0.286575 / 0.579283 (-0.292708)	0.270913 / 0.434364 (-0.163451)	0.323792 / 0.540337 (-0.216545)	0.419186 / 1.386936 (-0.967750)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005315 / 0.011353 (-0.006038)	0.003548 / 0.011008 (-0.007460)	0.049271 / 0.038508 (0.010763)	0.055198 / 0.023109 (0.032089)	0.275940 / 0.275898 (0.000042)	0.307637 / 0.323480 (-0.015843)	0.003997 / 0.007986 (-0.003988)	0.002544 / 0.004328 (-0.001785)	0.050381 / 0.004250 (0.046130)	0.041158 / 0.037052 (0.004105)	0.281519 / 0.258489 (0.023030)	0.308085 / 0.293841 (0.014244)	0.030464 / 0.128546 (-0.098083)	0.010690 / 0.075646 (-0.064957)	0.057458 / 0.419271 (-0.361814)	0.032814 / 0.043533 (-0.010719)	0.282435 / 0.255139 (0.027296)	0.301342 / 0.283200 (0.018142)	0.017556 / 0.141683 (-0.124127)	1.159423 / 1.452155 (-0.292732)	1.177344 / 1.492716 (-0.315372)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.091086 / 0.018006 (0.073079)	0.305316 / 0.000490 (0.304826)	0.000218 / 0.000200 (0.000019)	0.000054 / 0.000054 (-0.000000)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021832 / 0.037411 (-0.015579)	0.071055 / 0.014526 (0.056529)	0.082982 / 0.176557 (-0.093574)	0.119966 / 0.737135 (-0.617169)	0.083539 / 0.296338 (-0.212800)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.302501 / 0.215209 (0.087292)	2.936347 / 2.077655 (0.858692)	1.601658 / 1.504120 (0.097538)	1.467267 / 1.541195 (-0.073928)	1.514656 / 1.468490 (0.046166)	0.563934 / 4.584777 (-4.020843)	2.513715 / 3.745712 (-1.231997)	2.813014 / 5.269862 (-2.456847)	1.773243 / 4.565676 (-2.792433)	0.063208 / 0.424275 (-0.361067)	0.004979 / 0.007607 (-0.002628)	0.360694 / 0.226044 (0.134650)	3.520578 / 2.268929 (1.251650)	1.975369 / 55.444624 (-53.469255)	1.691257 / 6.876477 (-5.185220)	1.730872 / 2.142072 (-0.411200)	0.655366 / 4.805227 (-4.149861)	0.146043 / 6.500664 (-6.354621)	0.041386 / 0.075469 (-0.034083)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.979840 / 1.841788 (-0.861948)	12.456924 / 8.074308 (4.382616)	10.938595 / 10.191392 (0.747203)	0.133853 / 0.680424 (-0.546571)	0.015744 / 0.534201 (-0.518457)	0.289585 / 0.579283 (-0.289698)	0.291143 / 0.434364 (-0.143221)	0.328109 / 0.540337 (-0.212228)	0.561897 / 1.386936 (-0.825039)

Enhance the robustness of Table's __setstate__

ee984bb

LZHgrla marked this pull request as draft November 22, 2023 17:55

Update table.py

dc20ad8

LZHgrla marked this pull request as ready for review November 22, 2023 17:59

LZHgrla mentioned this pull request Nov 22, 2023

[Feature] Support LLaVA InternLM/xtuner#196

Merged

16 tasks

fix

57e7e53

LZHgrla changed the title ~~Enhance the robustness of Table's __setstate__~~ Remove Table.__getstate__ and Table.__setstate__ Nov 23, 2023

mariosasko and others added 2 commits November 23, 2023 15:26

Remove unused function and variable

34ca968

fix pre-commit

0c33cbd

mariosasko approved these changes Nov 23, 2023

View reviewed changes

mariosasko merged commit 05ec66c into huggingface:main Nov 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove `Table.getstate` and `Table.setstate` #6444

Remove `Table.getstate` and `Table.setstate` #6444

Uh oh!

LZHgrla commented Nov 22, 2023

Uh oh!

mariosasko commented Nov 22, 2023

Uh oh!

LZHgrla commented Nov 23, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Nov 23, 2023 •

edited

Loading

Uh oh!

mariosasko left a comment

Uh oh!

github-actions bot commented Nov 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove Table.__getstate__ and Table.__setstate__ #6444

Remove Table.__getstate__ and Table.__setstate__ #6444

Uh oh!

Conversation

LZHgrla commented Nov 22, 2023

Uh oh!

mariosasko commented Nov 22, 2023

Uh oh!

LZHgrla commented Nov 23, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Nov 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mariosasko left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 23, 2023

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove `Table.getstate` and `Table.setstate` #6444

Remove `Table.getstate` and `Table.setstate` #6444

HuggingFaceDocBuilderDev commented Nov 23, 2023 •

edited

Loading