Skip to content

Conversation

@AdamGS
Copy link
Contributor

@AdamGS AdamGS commented Jun 18, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate labels Jun 18, 2025
@jonathanc-n
Copy link
Contributor

I think we should run some benchmarks here to see if the regression is gone

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Definitely, I'll run our benchmarks once I get all tests passing here.

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

I'm getting a lot of sqllogictest failures, is there a reason to think there something weird going on? I was somewhat open to the idea its all fine until I ran into the last test in union_by_name.slt suddenly passing.

I'll push all of them soon so others can also take a look.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @AdamGS !

I think the reason that so many of the plans changed is that there is rule that skips repartitioning input files if they are past some threshold (repartition_file_min_size) and now that the statistics are enabled by default this code is triggered

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @AdamGS -- this looks great to me and I think the plan changes are also good / reasonable.

I think we should also add a note to the upgrade guide about this change: https://github.com/apache/datafusion/blob/main/docs/source/library-user-guide/upgrading.md

FYI @davisp


######
# When the setting is true, the statistics are gathered
# When the setting is true, statistics are gathered
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to other reviewers, the negative is still tested below:

When the setting is false, the statistics are NOT gathered

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Added a short upgrade note

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 18, 2025

Got a similar test failure to #16448 (issue filed in #16452). I have to conclude its personal at this point, I'll try and find some time to dig into it.

@AdamGS
Copy link
Contributor Author

AdamGS commented Jun 19, 2025

Our benchmarks show this change fixes the performance regression we saw - vortex-data/vortex#3567

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR changes the default of datafusion.execution.collect_statistics from false to true, updating documentation, configuration defaults, upgrade notes, and test expectations to match the new behavior.

  • Update user‐guide and config docs to reflect that statistics are now gathered by default.
  • Amend upgrade guide and example override to the new default.
  • Adjust code comments, configuration struct, and tests (unit and SQL logic tests) for default‐true behavior.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
docs/source/user-guide/sql/ddl.md Note updated to describe reading files by default
docs/source/user-guide/configs.md Default value for collect_statistics set to true
docs/source/library-user-guide/upgrading.md Upgrade notes and example override updated
datafusion/sqllogictest/test_files/repartition.slt Physical plan ordering updated
datafusion/sqllogictest/test_files/parquet_statistics.slt Expected stats presence in plan updated
datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt Removed redundant nodes, fixed spelling
datafusion/sqllogictest/test_files/limit.slt Stripped byte-range suffixes from file paths
datafusion/sqllogictest/test_files/information_schema.slt Schema output default toggled to true
datafusion/sqllogictest/test_files/explain_tree.slt Regenerated explain output formatting
datafusion/core/tests/parquet/row_group_pruning.rs Adjusted expected pruned_by_stats counts
datafusion/core/src/execution/context/parquet.rs Doc comments and test assertions updated
datafusion/common/src/config.rs Config default and doc comment updated
Comments suppressed due to low confidence (2)

docs/source/library-user-guide/upgrading.md:30

  • [nitpick] The sentence is incomplete. Consider rephrasing to clarify what is 'previous' (e.g., "restores the previous default behavior of ListingTable").
This change also restores the default behavior of `ListingTable` to its previous. If you use it directly

docs/source/library-user-guide/upgrading.md:36

  • Verify that the builder method name matches the actual API. If it's named with_collect_statistics rather than with_collect_stat, update this example accordingly.
    .with_collect_stat(false)

onursatici pushed a commit to vortex-data/vortex that referenced this pull request Jun 19, 2025
This is required for apache/datafusion#16447 to
take effect (once its merged). By default datafusion will take care of
that if you use the SQL interface (and do all the session state setup).

Signed-off-by: Adam Gutglick <[email protected]>
onursatici pushed a commit to vortex-data/vortex that referenced this pull request Jun 19, 2025
This is required for apache/datafusion#16447 to
take effect (once its merged). By default datafusion will take care of
that if you use the SQL interface (and do all the session state setup).

Signed-off-by: Adam Gutglick <[email protected]>
Copy link
Collaborator

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fuzzy failed again, presumably due to #16452 - rerun resolved it

@blaginin blaginin merged commit 2d7ae09 into apache:main Jun 19, 2025
52 of 53 checks passed
blaginin pushed a commit to blaginin/datafusion that referenced this pull request Jul 2, 2025
… `true` (apache#16447)

* fix sqllogicaltests
* Add upgrade note

(cherry picked from commit 2d7ae09)
alamb pushed a commit that referenced this pull request Jul 4, 2025
…tistics to true #16447  (#16659)

* Set the default value of `datafusion.execution.collect_statistics` to `true` (#16447)

* fix sqllogicaltests
* Add upgrade note

(cherry picked from commit 2d7ae09)

* Update row group pruning

---------

Co-authored-by: Adam Gutglick <[email protected]>
@alamb alamb added the performance Make DataFusion faster label Jul 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation performance Make DataFusion faster sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants