-
Notifications
You must be signed in to change notification settings - Fork 9
Read optimization using Iceberg metadata #1019
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
TODO:
Optional TODO:
|
|
@ianton-ru what's the speedup you are seeing with this optimization? Does it match the query response of a simple SELECT count() as described in #1000? |
Yes, partially. Now not covered case with column renames and case with constant NULL . this pr is required - ClickHouse#85829. But on cases with constant non-NULL values in non-renamed columns ClickHouse should take count and values of constant columns from Iceberg metadata. |
|
Test 03413_experimental_settings_cannot_be_enabled_by_default failed... |
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
WIP: Read optimization using Iceberg metadata
Documentation entry for user-facing changes
Solved #1000
Iceberg return min and max values for columns in each data file. With this info clickhouse can skip reading some columns for specific file when min=max, instead of fill it as constant with value from Iceberg metadata.
Main change:
Metadata for each file are sent to StorageObjectStorageSource::createReader, than checks if some columns are constant in current file (min value = max value and no nulls).
These columns removed from requested_columns lists.
Later in StorageObjectStorageSource::generate inserted back with value from metadata.
Other important change:
This metadata are sent with file name to other hosts during cluster requests. For this class CommandInTaskResponse from PR #866 reused. Serialization/deserialization is ugly for now, but works.
Tests are coming soon.
Exclude tests: