Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 13, 2021

It was not using open in the builder. Therefore pyarrow.json.read_json was downloading the full file to start yielding rows.

Moreover, it appeared that pyarrow.json.read_json was not really suited for streaming as it was downloading too much data and failing if block_size was not properly configured (related to #2573).

So I switched to using open which is extended to support reading from remote file progressively, and I removed the pyarrow json reader which was not practical.
Instead, I'm using the classical json.loads from the standard library.

@lhoestq lhoestq requested a review from albertvillanova July 13, 2021 14:37
Copy link
Member

@thomwolf thomwolf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's probably better in the end indeed

@thomwolf
Copy link
Member

A note is that I think we should add a few indicator of status (as mentioned by @stas00 in #2649), probably at the (1) downloading, (2) extracting and (3) reading steps. In particular when loading many very large files it's interesting to know a bit where we are in the process.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we should benchmark potential impacts on performance.

Using the json package from the Python Standard Library instead of pyarrow.json might have an impact on performance.

There are benchmarks that compare different JSON packages, with the Standard Library one among the worst performant:

I don't know the compared performance of pyarrow.json, but maybe we should check it.

@lhoestq
Copy link
Member Author

lhoestq commented Jul 16, 2021

I tested locally, and the builtin json loader is 4x slower than pyarrow.json. Thanks for the comment @albertvillanova !

Therefore I switched back to using pyarrow.json, but only on the batch that is read. This way we don't have to deal with its block_size, and it only loads in memory one batch at a time.

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!! 🤗

@lhoestq lhoestq merged commit decea05 into master Jul 16, 2021
@lhoestq lhoestq deleted the streaming-for-json branch July 16, 2021 15:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants