Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jul 19, 2021

Currently some files can't be read with the default parameters of the JSON lines reader.
For example this one:
https://huggingface.co/datasets/thomwolf/codeparrot/resolve/main/file-000000000006.json.gz

raises a pyarrow error:

ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

The block size that is used is the default one by pyarrow (related to this jira issue).

To fix this issue I changed the block_size to increase automatically if there is a straddling issue when parsing a batch of json lines.

By default the value is chunksize // 32 in order to leverage multithreading, and it doubles every time a straddling issue occurs. The block_size is then reset for each file.

cc @thomwolf @albertvillanova

@lhoestq lhoestq merged commit 959fc73 into master Jul 19, 2021
@lhoestq lhoestq deleted the increase-json-reader-block-size-if-needed branch July 19, 2021 17:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants