-
Notifications
You must be signed in to change notification settings - Fork 11
25.8 Antalya ports: Read optimization using Iceberg metadata #1069
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
25.8 Antalya ports: Read optimization using Iceberg metadata #1069
Conversation
| /// Delta lake related object metadata. | ||
| std::optional<DataLakeObjectMetadata> data_lake_metadata; | ||
| /// Information about columns | ||
| std::optional<DataFileMetaInfoPtr> file_meta_info; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why optional shared_ptr? What would set null-value mean in this case and how is it different from unset value?
| std::map<size_t, ConstColumnWithValue> constant_columns_with_values; | ||
| std::unordered_set<String> constant_columns; | ||
|
|
||
| NamesAndTypesList requested_columns_copy = read_from_format_info.requested_columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid making a copy here, and use original columns for building requested_columns_list ? I'm Ok with moving that copying down, right before we actually modify it
| requested_columns_list[column.getNameInStorage()] = std::make_pair(column_index++, column); | ||
| } | ||
|
|
||
| if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to have a general description of this optimization, as a comment to the whole section.
| if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization]) | ||
| { | ||
| auto file_meta_data = object_info->getFileMetaInfo(); | ||
| if (file_meta_data.has_value()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, this is an excellent example: file_meta_data.value() can be null here. I know that there are precautions now, but this is fragile, IMO simple std::shared_ptr would be more robust here.
| if (file_meta_data.has_value()) | |
| if (file_meta_data) |
and
file_meta_data.value()->columns_info
would be replaced by just
file_meta_data->columns_info
Antalya-specific exception: allow_experimental_iceberg_read_optimization
6c8fbaf to
6777073
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code just shot my leg :)
| public: | ||
| DataFileMetaInfo() = default; | ||
|
|
||
| // subset of Iceberg::ColumnInfo now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this? Why cannot we use Iceberg::ColumnInfo? This is ambiguity with no obvious reason. We have local ColumnInfo (with almost the same subset of fields), and forward-declared another ColumnInfo, and then use them all together. Why?
…wipes_my_snot 25.8 Antalya: Fixes for #1069
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Read optimization using Iceberg metadata
Port of #1019 with many changes
Documentation entry for user-facing changes
Diff with #1019
Exclude tests: