Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 22, 2021

Continuation of #2247

I added a "parquet" dataset builder, as well as the methods Dataset.from_parquet and Dataset.to_parquet.
As usual, the data are converted to arrow in a batched way to avoid loading everything in memory.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 22, 2021

pyarrow 1.0.0 doesn't support some types in parquet, we'll have to bump its minimum version.

Also I still need to add dummy data to test the parquet builder.

@lhoestq lhoestq marked this pull request as ready for review June 23, 2021 16:32
@lhoestq lhoestq requested a review from albertvillanova June 24, 2021 13:11
@lhoestq
Copy link
Member Author

lhoestq commented Jun 24, 2021

I had to bump the minimum pyarrow version to 3.0.0 to properly support parquet.

Everything is ready for review now :)
I reused pretty much the same tests we had for CSV

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lhoestq.

Only some small comments/questions...

setup.py Outdated
# Minimum 3.0.0 to support mix of struct and list types in parquet format
# pyarrow 4.0.0 introduced segfault bug, see: https://github.com/huggingface/datasets/pull/2268
"pyarrow>=1.0.0,!=4.0.0",
"pyarrow>=3.0.0,!=4.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure that it is a good idea to stop supporting pyarrow < 3.0.0? Just to be sure of this choice... Maybe a softer option would be to set this requirement only for the users that want to use Parquet files...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I just checked and there are still many projects that use pyarrow 2.0.0 and 1.0.1
I'll make the change

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some little fixes.

@lhoestq
Copy link
Member Author

lhoestq commented Jun 30, 2021

Done !
Now we're still allowing pyarrow>=1.0.0, but when users want to use parquet features they're asked to update to pyarrow>=3.0.0

@lhoestq lhoestq merged commit 13434ae into master Jun 30, 2021
@lhoestq lhoestq deleted the parquet branch June 30, 2021 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants