Skip to content

Conversation

@bharath-techie
Copy link
Contributor

@bharath-techie bharath-techie commented Nov 27, 2025

Pre-warm listing file statistics cache during create listing table flow as suggested in #18952.
Reused list_files_for_scan to pre-warm.

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Yes unit tested.

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Nov 27, 2025

// Pre-warm statistics cache if collect_statistics is enabled
if session_state.config().collect_statistics() {
let _ = table.list_files_for_scan(state, &[], None).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Is it okay to reuse this method to pre-warm as we do couple more things post collecting the statistics ?
  2. Also is no limit fine ? as list_file_statistics_cache doesn't seem to have any size limit unlike metadata cache ?

cc: @alamb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Also is no limit fine ?

I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.

Copy link
Contributor Author

@bharath-techie bharath-techie Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @martin-g for reviewing.

Agree on having limit.

But doing it in background will result in inconsistent behavior ?

Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.

Will a user not expect the statistics to be collected when creating the table and expect any query post that to be optimized based on the above documentation ?


// Pre-warm statistics cache if collect_statistics is enabled
if session_state.config().collect_statistics() {
let _ = table.list_files_for_scan(state, &[], None).await?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Also is no limit fine ?

I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.

@bharath-techie
Copy link
Contributor Author

Hi @martin-g @alamb ,

Can you help decide following to move the PR forward ?

  • The limit is based on number of rows, so we either we keep it as None or we make it configurable maybe ? If we are going to configuration route, I'm leaning towards keeping None as default as number of rows will vary based on data. Some users might have few columns, lots of rows and some vice versa.

  • Can you help decide whether to do this in background ? Doing it in sync path will be deterministic behavior I feel. Otherwise we need to update documentation to reflect the same.

@martin-g
Copy link
Member

My concern is that if there are many files then the pre-warming of the cache may slow down the main operation. I am not sure how many is too many though.

But I guess you could leave it as is for now and optimize it only after there is an evidence that it really causes slow downs for someone.

@alamb
Copy link
Contributor

alamb commented Nov 30, 2025

Thank you @bharath-techie -- I am checking this one out

@xudong963
Copy link
Member

In our internal, we do stats collection in a separate task.

Background task is a good idea, but based on the current situation, the PR looks like the fastest way to have stats while creating table.

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

My concern is that if there are many files then the pre-warming of the cache may slow down the main operation. I am not sure how many is too many though.

But I guess you could leave it as is for now and optimize it only after there is an evidence that it really causes slow downs for someone.

Thanks @martin-g -- I do agree that is a concern. However, I think the idea is that any subsequent query is going to collect the statistics anyways, so this PR simply moves which statement triggers the collection

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

I took the liberty of pushing some commits that fixed the CI failures and merging up from main

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @bharath-techie -- this makes sense to me 🙏

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

Thank you also @xudong963 and @martin-g for the reviews

In our internal, we do stats collection in a separate task.

@xudong963 is this something we should write down / document / implement? Maybe not the actual behavior of pre-fetching statistics in the background but some method or something to make it easier for others to call?

@bharath-techie
Copy link
Contributor Author

Thanks @alamb for the review and changes :)

Also listing file statistics cache seems to not have any memory limit unlike metadata cache for example.

Is that by design , do you think we need to add similar limit for this cache too ?

@xudong963
Copy link
Member

is this something we should write down / document / implement? Maybe not the actual behavior of pre-fetching statistics in the background but some method or something to make it easier for others to call?

Yeah, I would open an issue for this later

@alamb
Copy link
Contributor

alamb commented Dec 2, 2025

Thanks @alamb for the review and changes :)

Also listing file statistics cache seems to not have any memory limit unlike metadata cache for example.

Is that by design , do you think we need to add similar limit for this cache too ?

I think it is an oversight and we should add a similar limit. @nuno-faria actually asked the same question here:

I will file a subsequent ticket to track that as well

@alamb
Copy link
Contributor

alamb commented Dec 2, 2025

Filed a ticket to track limiting the statistics cache:

Merged via the queue into apache:main with commit 4d86ae0 Dec 3, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Dec 3, 2025

Thanks again @bharath-techie and @martin-g and @xudong963

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: statistics not collected automatically upon creation of ListingTable

4 participants