-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: create file for empty stream #16342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Maybe add same test for write_parquet and write_json? I think they should have same behavior. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chenkovsky -- the only question I have about this PR is how it works when writing partitioned output (aka into a directory).
The only thing I think the PR needs is a test showing what happens when writing to a directory.
I am not 100% sure what the expected behavior would be in this case, and I think either one is probably reasonable.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chenkovsky 🙏
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chenkovsky
| ) | ||
| .await?; | ||
| if num_rows == 0 { | ||
| // If no rows were written, then no files are output either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You say now row => no file was created.
But then you say write an empty recordbatch => ensure a file gets created.
Except an empty recordbatch has no rows (at least when written to a parquet file).
Your 2 sentences don't make sense together.
In practice, this PR caused a regression: we cannot write empty recordbatch to parquet anymore, as the code here tries to write it a second time, and we get an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I didn't reproduce you problem. could you please share your test. I will check it ASAP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I think we need to define 'empty' in datafusion clearly. currently it's vec![]
| Ok(vec![]) |
vec![empty_record_batch].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, I write one empty RecordBatch. Here is a short repro of the issue: brunal@aed5316.
I don't think num_rows should be used to determine whether a file was created.
|
User facing breakage: one cannot explicitly write an empty recordbatch that has a schema anymore. The tests in the PR don't have a schema so they don't reveal the issue. |
This reverts commit 4084894.
The test fails due to apache#16342: datafusion tries to write the file twice.
This reverts commit 4084894. It adds a test showcasing the functionality that the commit above broke: writing a parquet file from an empty RecordBatch.
This reverts commit 4084894. It adds a test showcasing the functionality that the commit above broke: writing a parquet file from an empty RecordBatch.
This reverts commit 4084894. It adds a test showcasing the functionality that the commit above broke: writing a parquet file from an empty RecordBatch.
* Revert "fix: create file for empty stream (#16342)" This reverts commit 4084894. It adds a test showcasing the functionality that the commit above broke: writing a parquet file from an empty RecordBatch. * Add verification that the schema is correct --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Rationale for this change
file won't be created for empty dataframe
What changes are included in this PR?
create file if row number is zero
Are these changes tested?
UT
Are there any user-facing changes?
No