Skip to content

Conversation

@ianton-ru
Copy link

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Read optimization using Iceberg metadata
Port of #1019 with many changes

Documentation entry for user-facing changes

Diff with #1019

  • support of NULLs
  • support of renamed columns
  • compatibility with ALTER TABLE DELETE
  • changed protocol to send column info on swarm nodes

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

@github-actions
Copy link

github-actions bot commented Oct 8, 2025

Workflow [PR], commit [83012cb]

/// Delta lake related object metadata.
std::optional<DataLakeObjectMetadata> data_lake_metadata;
/// Information about columns
std::optional<DataFileMetaInfoPtr> file_meta_info;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why optional shared_ptr? What would set null-value mean in this case and how is it different from unset value?

std::map<size_t, ConstColumnWithValue> constant_columns_with_values;
std::unordered_set<String> constant_columns;

NamesAndTypesList requested_columns_copy = read_from_format_info.requested_columns;
Copy link
Member

@Enmk Enmk Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we avoid making a copy here, and use original columns for building requested_columns_list ? I'm Ok with moving that copying down, right before we actually modify it

requested_columns_list[column.getNameInStorage()] = std::make_pair(column_index++, column);
}

if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization])
Copy link
Member

@Enmk Enmk Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to have a general description of this optimization, as a comment to the whole section.

if (context_->getSettingsRef()[Setting::allow_experimental_iceberg_read_optimization])
{
auto file_meta_data = object_info->getFileMetaInfo();
if (file_meta_data.has_value())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, this is an excellent example: file_meta_data.value() can be null here. I know that there are precautions now, but this is fragile, IMO simple std::shared_ptr would be more robust here.

Suggested change
if (file_meta_data.has_value())
if (file_meta_data)

and

file_meta_data.value()->columns_info

would be replaced by just

file_meta_data->columns_info

@Enmk Enmk force-pushed the frontport/antalya-25.8/optimize_count_in_datalake branch from 6c8fbaf to 6777073 Compare October 10, 2025 13:59
@Enmk Enmk merged commit 7c5fd55 into antalya-25.8 Oct 12, 2025
126 of 137 checks passed
@ianton-ru ianton-ru mentioned this pull request Oct 14, 2025
25 tasks
Copy link
Collaborator

@zvonand zvonand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code just shot my leg :)

public:
DataFileMetaInfo() = default;

// subset of Iceberg::ColumnInfo now
Copy link
Collaborator

@zvonand zvonand Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? Why cannot we use Iceberg::ColumnInfo? This is ambiguity with no obvious reason. We have local ColumnInfo (with almost the same subset of fields), and forward-declared another ColumnInfo, and then use them all together. Why?

zvonand added a commit that referenced this pull request Oct 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants