Increase json reader block_size automatically #2676

lhoestq · 2021-07-19T14:51:14Z

Currently some files can't be read with the default parameters of the JSON lines reader.
For example this one:
https://huggingface.co/datasets/thomwolf/codeparrot/resolve/main/file-000000000006.json.gz

raises a pyarrow error:

ArrowInvalid: straddling object straddles two block boundaries (try to increase block size?)

The block size that is used is the default one by pyarrow (related to this jira issue).

To fix this issue I changed the block_size to increase automatically if there is a straddling issue when parsing a batch of json lines.

By default the value is chunksize // 32 in order to leverage multithreading, and it doubles every time a straddling issue occurs. The block_size is then reset for each file.

cc @thomwolf @albertvillanova

lhoestq added 2 commits July 19, 2021 16:42

increase json reader block size automatically

4f2d379

directly raise errors that are not straddling errors

e799f64

lhoestq merged commit 959fc73 into master Jul 19, 2021

lhoestq deleted the increase-json-reader-block-size-if-needed branch July 19, 2021 17:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Increase json reader block_size automatically #2676

Increase json reader block_size automatically #2676

Uh oh!

lhoestq commented Jul 19, 2021 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Increase json reader block_size automatically #2676

Increase json reader block_size automatically #2676

Uh oh!

Conversation

lhoestq commented Jul 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq commented Jul 19, 2021 •

edited

Loading