-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: pre-warm listing file statistics cache during listing table creation #18971
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is it okay to reuse this method to pre-warm as we do couple more things post collecting the statistics ?
- Also is no limit fine ? as list_file_statistics_cache doesn't seem to have any size limit unlike metadata cache ?
cc: @alamb
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @martin-g for reviewing.
Agree on having limit.
But doing it in background will result in inconsistent behavior ?
Should DataFusion collect statistics when first creating a table. Has no effect after the table is created. Applies to the default ListingTableProvider in DataFusion. Defaults to true.
Will a user not expect the statistics to be collected when creating the table and expect any query post that to be optimized based on the above documentation ?
|
|
||
| // Pre-warm statistics cache if collect_statistics is enabled | ||
| if session_state.config().collect_statistics() { | ||
| let _ = table.list_files_for_scan(state, &[], None).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Also is no limit fine ?
I think it should have a limit.
And maybe it should be done in the background.
If there are many files this may slow down things.
…tion Signed-off-by: bharath-techie <[email protected]>
Signed-off-by: bharath-techie <[email protected]>
8b8a49e to
50eaaf5
Compare
|
Can you help decide following to move the PR forward ?
|
|
My concern is that if there are many files then the pre-warming of the cache may slow down the main operation. I am not sure how many is too many though. But I guess you could leave it as is for now and optimize it only after there is an evidence that it really causes slow downs for someone. |
|
Thank you @bharath-techie -- I am checking this one out |
|
In our internal, we do stats collection in a separate task. Background task is a good idea, but based on the current situation, the PR looks like the fastest way to have stats while creating table. |
Thanks @martin-g -- I do agree that is a concern. However, I think the idea is that any subsequent query is going to collect the statistics anyways, so this PR simply moves which statement triggers the collection |
|
I took the liberty of pushing some commits that fixed the CI failures and merging up from main |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @bharath-techie -- this makes sense to me 🙏
|
Thank you also @xudong963 and @martin-g for the reviews
@xudong963 is this something we should write down / document / implement? Maybe not the actual behavior of pre-fetching statistics in the background but some method or something to make it easier for others to call? |
|
Thanks @alamb for the review and changes :) Also Is that by design , do you think we need to add similar limit for this cache too ? |
Yeah, I would open an issue for this later |
I think it is an oversight and we should add a similar limit. @nuno-faria actually asked the same question here: I will file a subsequent ticket to track that as well |
|
Filed a ticket to track limiting the statistics cache: |
|
Thanks again @bharath-techie and @martin-g and @xudong963 |
Pre-warm listing file statistics cache during create listing table flow as suggested in #18952.
Reused
list_files_for_scanto pre-warm.Which issue does this PR close?
Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes unit tested.
Are there any user-facing changes?