-
Notifications
You must be signed in to change notification settings - Fork 3k
Streaming for the Json loader #2638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
thomwolf
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's probably better in the end indeed
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, we should benchmark potential impacts on performance.
Using the json package from the Python Standard Library instead of pyarrow.json might have an impact on performance.
There are benchmarks that compare different JSON packages, with the Standard Library one among the worst performant:
- https://github.com/ultrajson/ultrajson#benchmarks
- https://github.com/ijl/orjson#performance
I don't know the compared performance of pyarrow.json, but maybe we should check it.
|
I tested locally, and the builtin Therefore I switched back to using |
albertvillanova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!! 🤗
It was not using
openin the builder. Thereforepyarrow.json.read_jsonwas downloading the full file to start yielding rows.Moreover, it appeared that
pyarrow.json.read_jsonwas not really suited for streaming as it was downloading too much data and failing ifblock_sizewas not properly configured (related to #2573).So I switched to using
openwhich is extended to support reading from remote file progressively, and I removed the pyarrow json reader which was not practical.Instead, I'm using the classical
json.loadsfrom the standard library.