Add `fsspec` support for `to_json`, `to_csv`, and `to_parquet` #6096

alvarobartt · 2023-07-28T16:36:59Z

Hi to whoever is reading this! 🤗 (Most likely @mariosasko)

What's in this PR?

This PR replaces the open from Python with fsspec.open and adds the argument storage_options for the methods to_json, to_csv, and to_parquet, to allow users to export any 🤗Dataset into a file in a file-system as requested at #6086.

What's missing in this PR?

As per to_json, to_csv, and to_parquet docstrings for the recently included storage_options arg, I've scoped it to 2.15.0, so we should check that before merging in case we want to scope that for 2.14.2 instead.

Additionally, should we also add fsspec support for the from_csv, from_json, and from_parquet methods? If you want me to do so @mariosasko just let me know and I'll create another PR to support that too!

Fix #6086.

HuggingFaceDocBuilderDev · 2023-07-28T16:43:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

Included for `to_json`, `to_parquet`, and `to_csv` only

alvarobartt · 2024-03-05T08:08:29Z

Hi here @lhoestq @mariosasko I just realised this PR is still open, should we close it in case this is something not to include within datasets, or should we merge? Let me know whatever you decide 🤗

lhoestq

Thanks for the ping ! It looks good to me, I just added a few suggestions before merging:

src/datasets/arrow_dataset.py

Co-authored-by: Quentin Lhoest <[email protected]>

lhoestq

LGTM thanks !

github-actions · 2024-03-06T11:18:37Z

Show benchmarks

PyArrow==8.0.0

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005011 / 0.011353 (-0.006342)	0.003203 / 0.011008 (-0.007806)	0.064033 / 0.038508 (0.025524)	0.029152 / 0.023109 (0.006043)	0.242884 / 0.275898 (-0.033014)	0.263517 / 0.323480 (-0.059963)	0.004088 / 0.007986 (-0.003898)	0.002570 / 0.004328 (-0.001759)	0.049061 / 0.004250 (0.044811)	0.040170 / 0.037052 (0.003117)	0.263305 / 0.258489 (0.004816)	0.286255 / 0.293841 (-0.007586)	0.028206 / 0.128546 (-0.100340)	0.010337 / 0.075646 (-0.065309)	0.206235 / 0.419271 (-0.213036)	0.038182 / 0.043533 (-0.005351)	0.246486 / 0.255139 (-0.008653)	0.263077 / 0.283200 (-0.020122)	0.017850 / 0.141683 (-0.123833)	1.173921 / 1.452155 (-0.278234)	1.255583 / 1.492716 (-0.237133)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090278 / 0.018006 (0.072272)	0.298146 / 0.000490 (0.297657)	0.000215 / 0.000200 (0.000015)	0.000043 / 0.000054 (-0.000011)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.018021 / 0.037411 (-0.019390)	0.061434 / 0.014526 (0.046908)	0.072617 / 0.176557 (-0.103939)	0.119063 / 0.737135 (-0.618072)	0.073997 / 0.296338 (-0.222341)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.288496 / 0.215209 (0.073287)	2.794943 / 2.077655 (0.717288)	1.538299 / 1.504120 (0.034179)	1.399164 / 1.541195 (-0.142031)	1.419104 / 1.468490 (-0.049386)	0.566147 / 4.584777 (-4.018630)	2.386687 / 3.745712 (-1.359025)	2.723584 / 5.269862 (-2.546278)	1.699161 / 4.565676 (-2.866515)	0.062526 / 0.424275 (-0.361750)	0.004927 / 0.007607 (-0.002680)	0.345132 / 0.226044 (0.119087)	3.389634 / 2.268929 (1.120706)	1.898012 / 55.444624 (-53.546612)	1.599050 / 6.876477 (-5.277427)	1.614289 / 2.142072 (-0.527783)	0.656716 / 4.805227 (-4.148511)	0.118480 / 6.500664 (-6.382184)	0.041913 / 0.075469 (-0.033557)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	0.968676 / 1.841788 (-0.873111)	11.184668 / 8.074308 (3.110360)	9.249912 / 10.191392 (-0.941480)	0.141139 / 0.680424 (-0.539285)	0.014207 / 0.534201 (-0.519994)	0.287603 / 0.579283 (-0.291680)	0.262792 / 0.434364 (-0.171572)	0.340239 / 0.540337 (-0.200099)	0.437471 / 1.386936 (-0.949465)

PyArrow==latest

Show updated benchmarks!

Benchmark: benchmark_array_xd.json

metric	read_batch_formatted_as_numpy after write_array2d	read_batch_formatted_as_numpy after write_flattened_sequence	read_batch_formatted_as_numpy after write_nested_sequence	read_batch_unformated after write_array2d	read_batch_unformated after write_flattened_sequence	read_batch_unformated after write_nested_sequence	read_col_formatted_as_numpy after write_array2d	read_col_formatted_as_numpy after write_flattened_sequence	read_col_formatted_as_numpy after write_nested_sequence	read_col_unformated after write_array2d	read_col_unformated after write_flattened_sequence	read_col_unformated after write_nested_sequence	read_formatted_as_numpy after write_array2d	read_formatted_as_numpy after write_flattened_sequence	read_formatted_as_numpy after write_nested_sequence	read_unformated after write_array2d	read_unformated after write_flattened_sequence	read_unformated after write_nested_sequence	write_array2d	write_flattened_sequence	write_nested_sequence
new / old (diff)	0.005268 / 0.011353 (-0.006085)	0.003142 / 0.011008 (-0.007866)	0.049333 / 0.038508 (0.010825)	0.029558 / 0.023109 (0.006449)	0.270716 / 0.275898 (-0.005182)	0.293834 / 0.323480 (-0.029646)	0.004285 / 0.007986 (-0.003701)	0.002703 / 0.004328 (-0.001626)	0.048857 / 0.004250 (0.044607)	0.043456 / 0.037052 (0.006404)	0.286058 / 0.258489 (0.027569)	0.313491 / 0.293841 (0.019650)	0.029336 / 0.128546 (-0.099210)	0.010287 / 0.075646 (-0.065360)	0.057753 / 0.419271 (-0.361518)	0.050867 / 0.043533 (0.007334)	0.271717 / 0.255139 (0.016578)	0.291468 / 0.283200 (0.008268)	0.018668 / 0.141683 (-0.123015)	1.137399 / 1.452155 (-0.314755)	1.186315 / 1.492716 (-0.306401)

Benchmark: benchmark_getitem_100B.json

metric	get_batch_of_1024_random_rows	get_batch_of_1024_rows	get_first_row	get_last_row
new / old (diff)	0.090289 / 0.018006 (0.072283)	0.297987 / 0.000490 (0.297497)	0.000227 / 0.000200 (0.000027)	0.000044 / 0.000054 (-0.000010)

Benchmark: benchmark_indices_mapping.json

metric	select	shard	shuffle	sort	train_test_split
new / old (diff)	0.021012 / 0.037411 (-0.016399)	0.075046 / 0.014526 (0.060520)	0.085295 / 0.176557 (-0.091261)	0.123879 / 0.737135 (-0.613257)	0.086572 / 0.296338 (-0.209766)

Benchmark: benchmark_iterating.json

metric	read 5000	read 50000	read_batch 50000 10	read_batch 50000 100	read_batch 50000 1000	read_formatted numpy 5000	read_formatted pandas 5000	read_formatted tensorflow 5000	read_formatted torch 5000	read_formatted_batch numpy 5000 10	read_formatted_batch numpy 5000 1000	shuffled read 5000	shuffled read 50000	shuffled read_batch 50000 10	shuffled read_batch 50000 100	shuffled read_batch 50000 1000	shuffled read_formatted numpy 5000	shuffled read_formatted_batch numpy 5000 10	shuffled read_formatted_batch numpy 5000 1000
new / old (diff)	0.293350 / 0.215209 (0.078141)	2.875958 / 2.077655 (0.798304)	1.586460 / 1.504120 (0.082340)	1.467950 / 1.541195 (-0.073245)	1.453478 / 1.468490 (-0.015012)	0.566083 / 4.584777 (-4.018693)	2.462582 / 3.745712 (-1.283130)	2.609367 / 5.269862 (-2.660495)	1.709691 / 4.565676 (-2.855985)	0.062928 / 0.424275 (-0.361347)	0.005040 / 0.007607 (-0.002567)	0.337997 / 0.226044 (0.111952)	3.347235 / 2.268929 (1.078306)	1.923940 / 55.444624 (-53.520684)	1.657731 / 6.876477 (-5.218746)	1.747469 / 2.142072 (-0.394604)	0.657061 / 4.805227 (-4.148167)	0.116655 / 6.500664 (-6.384009)	0.040363 / 0.075469 (-0.035106)

Benchmark: benchmark_map_filter.json

metric	filter	map fast-tokenizer batched	map identity	map identity batched	map no-op batched	map no-op batched numpy	map no-op batched pandas	map no-op batched pytorch	map no-op batched tensorflow
new / old (diff)	1.011171 / 1.841788 (-0.830617)	11.705905 / 8.074308 (3.631597)	10.064391 / 10.191392 (-0.127001)	0.141681 / 0.680424 (-0.538743)	0.014763 / 0.534201 (-0.519438)	0.286425 / 0.579283 (-0.292858)	0.271036 / 0.434364 (-0.163328)	0.321393 / 0.540337 (-0.218944)	0.424539 / 1.386936 (-0.962397)

albertvillanova · 2024-03-07T07:20:44Z

Thanks @alvarobartt.

I am linking this PR to the corresponding issue (on the right column, under "Development") and closing the issue.

For future contributions, please add to the PR description the word "fix" followed by the issue number, e.g.:

Fix #6086.

I have edited the PR description to add this.

alvarobartt · 2024-03-07T07:58:15Z

Thanks @alvarobartt.

I am linking this PR to the corresponding issue (on the right column, under "Development") and closing the issue.

For future contributions, please add to the PR description the word "fix" followed by the issue number, e.g.:
Fix #6086.
I have edited the PR description to add this.

Hi @albertvillanova, fair, I missed that, thanks for the edit and the heads up!

alvarobartt added 2 commits July 28, 2023 17:53

Add fsspec in write method

252b9ca

Add fsspec tests for to_csv, to_json, & to_parquet

8b5a01c

alvarobartt added 3 commits July 28, 2023 19:03

Revert iter_csv_file to keep on using open

de035cb

Add storage_options arg in to_* methods

dba3066

Included for `to_json`, `to_parquet`, and `to_csv` only

Add mockfs and rewrite fsspec unit tests

ef7d6f6

alvarobartt marked this pull request as ready for review July 28, 2023 17:10

alvarobartt added 3 commits July 31, 2023 09:41

Fix version of storage_optionsin docstring

3ebef04

Merge branch 'main' into fsspec-on-write-to-file

dc2e5f9

Merge branch 'main' into fsspec-on-write-to-file

a49e7f5

lhoestq reviewed Mar 5, 2024

View reviewed changes

alvarobartt and others added 2 commits March 6, 2024 10:25

Apply suggestions from code review

f21af16

Co-authored-by: Quentin Lhoest <[email protected]>

Merge branch 'main' into fsspec-on-write-to-file

c7907f4

lhoestq approved these changes Mar 6, 2024

View reviewed changes

lhoestq merged commit e52f4d0 into huggingface:main Mar 6, 2024

albertvillanova mentioned this pull request Mar 7, 2024

Support fsspec in Dataset.to_<format> methods #6086

Closed

alvarobartt deleted the fsspec-on-write-to-file branch May 28, 2024 07:40

Add fsspec support for to_json, to_csv, and to_parquet #6096

Add fsspec support for to_json, to_csv, and to_parquet #6096

Uh oh!

Conversation

alvarobartt commented Jul 28, 2023 • edited by albertvillanova Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What's in this PR?

What's missing in this PR?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 28, 2023

Uh oh!

alvarobartt commented Mar 5, 2024

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 6, 2024

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Benchmark: benchmark_array_xd.json

Benchmark: benchmark_getitem_100B.json

Benchmark: benchmark_indices_mapping.json

Benchmark: benchmark_iterating.json

Benchmark: benchmark_map_filter.json

Uh oh!

albertvillanova commented Mar 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alvarobartt commented Mar 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add `fsspec` support for `to_json`, `to_csv`, and `to_parquet` #6096

Add `fsspec` support for `to_json`, `to_csv`, and `to_parquet` #6096

alvarobartt commented Jul 28, 2023 •

edited by albertvillanova

Loading

albertvillanova commented Mar 7, 2024 •

edited

Loading