Skip to content

Conversation

@alvarobartt
Copy link
Member

As spotted by @cccntu in #4502, there's a logic bug in ArrowWriter.write_batch as the if-statement to handle the empty batches as detailed in the docstrings of the function ("Ignores the batch if it appears to be empty, preventing a potential schema update of unknown types."), the current if-statement is not handling properly writer.write_batch({}) as an error is triggered.

Also, if we add a regression test in test_arrow_writer.py::test_write_batch before applying the fix, the test will fail as when trying to write an empty batch as follows:

=================================================================================== short test summary info ===================================================================================
FAILED tests/test_arrow_writer.py::test_write_batch[None-None] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[None-1] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[None-10] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields1-None] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields1-1] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields1-10] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields2-None] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields2-1] - ValueError: Schema and number of arrays unequal
FAILED tests/test_arrow_writer.py::test_write_batch[fields2-10] - ValueError: Schema and number of arrays unequal
======================================================================== 9 failed, 73 deselected, 7 warnings in 0.81s =========================================================================

So the batch is not ignored when empty, as batch_examples={} won't match the condition if batch_examples: ....

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 16, 2022

The documentation is not available anymore as the PR was closed or merged.

@alvarobartt alvarobartt changed the title Fix if-statement in ArrowWriter.write_batch to handle empty batches Add regression test for ArrowWriter.write_batch when batch is empty Jun 16, 2022
@alvarobartt
Copy link
Member Author

As mentioned by @lhoestq, the current behavior is correct and we should not expect batches with different columns, in that case, the if should fail, as the values of the batch can be empty, but not the actual batch_examples value.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks !

@lhoestq lhoestq merged commit 4a3c286 into huggingface:master Jun 16, 2022
khushmeeet pushed a commit to khushmeeet/datasets that referenced this pull request Jun 23, 2022
…huggingface#4510)

* Fix if-statement logic in `write_batch`

* Add regression test on `test_write_batch`

* Replace AND-condition with OR in `write_batch`

* Remove `write_batch({})` in `test_write_batch`

* Add valid batch as `write_batch` arg

* Revert if-statement change in `write_batch`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants