Skip to content

Conversation

@chenkovsky
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

file won't be created for empty dataframe

What changes are included in this PR?

create file if row number is zero

Are these changes tested?

UT

Are there any user-facing changes?

No

@github-actions github-actions bot added core Core DataFusion crate datasource Changes to the datasource crate labels Jun 9, 2025
@mmooyyii
Copy link

Maybe add same test for write_parquet and write_json? I think they should have same behavior.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chenkovsky -- the only question I have about this PR is how it works when writing partitioned output (aka into a directory).

The only thing I think the PR needs is a test showing what happens when writing to a directory.

I am not 100% sure what the expected behavior would be in this case, and I think either one is probably reasonable.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chenkovsky 🙏

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @chenkovsky

@alamb alamb merged commit 4084894 into apache:main Jun 18, 2025
28 checks passed
)
.await?;
if num_rows == 0 {
// If no rows were written, then no files are output either.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say now row => no file was created.

But then you say write an empty recordbatch => ensure a file gets created.

Except an empty recordbatch has no rows (at least when written to a parquet file).

Your 2 sentences don't make sense together.

In practice, this PR caused a regression: we cannot write empty recordbatch to parquet anymore, as the code here tries to write it a second time, and we get an error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I didn't reproduce you problem. could you please share your test. I will check it ASAP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I think we need to define 'empty' in datafusion clearly. currently it's vec![]

, I guess brunal uses vec![empty_record_batch].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I write one empty RecordBatch. Here is a short repro of the issue: brunal@aed5316.

I don't think num_rows should be used to determine whether a file was created.

@brunal
Copy link
Contributor

brunal commented Jul 4, 2025

User facing breakage: one cannot explicitly write an empty recordbatch that has a schema anymore.

The tests in the PR don't have a schema so they don't reveal the issue.

brunal added a commit to brunal/datafusion that referenced this pull request Jul 4, 2025
brunal added a commit to brunal/datafusion that referenced this pull request Jul 5, 2025
The test fails due to apache#16342:
datafusion tries to write the file twice.
brunal added a commit to brunal/datafusion that referenced this pull request Jul 7, 2025
This reverts commit 4084894.

It adds a test showcasing the functionality that the commit above broke:
writing a parquet file from an empty RecordBatch.
brunal added a commit to brunal/datafusion that referenced this pull request Jul 8, 2025
This reverts commit 4084894.

It adds a test showcasing the functionality that the commit above broke:
writing a parquet file from an empty RecordBatch.
brunal added a commit to brunal/datafusion that referenced this pull request Jul 8, 2025
This reverts commit 4084894.

It adds a test showcasing the functionality that the commit above broke:
writing a parquet file from an empty RecordBatch.
alamb added a commit that referenced this pull request Jul 8, 2025
* Revert "fix: create file for empty stream (#16342)"

This reverts commit 4084894.

It adds a test showcasing the functionality that the commit above broke:
writing a parquet file from an empty RecordBatch.

* Add verification that the schema is correct

---------

Co-authored-by: Andrew Lamb <[email protected]>
@alamb
Copy link
Contributor

alamb commented Jul 8, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

How to write csv file to disk from a empty dataframe?

4 participants