Skip to content

Conversation

@ianton-ru
Copy link

@ianton-ru ianton-ru commented Sep 17, 2025

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

WIP: Read optimization using Iceberg metadata

Documentation entry for user-facing changes

Solved #1000
Iceberg return min and max values for columns in each data file. With this info clickhouse can skip reading some columns for specific file when min=max, instead of fill it as constant with value from Iceberg metadata.

Main change:
Metadata for each file are sent to StorageObjectStorageSource::createReader, than checks if some columns are constant in current file (min value = max value and no nulls).
These columns removed from requested_columns lists.
Later in StorageObjectStorageSource::generate inserted back with value from metadata.

Other important change:
This metadata are sent with file name to other hosts during cluster requests. For this class CommandInTaskResponse from PR #866 reused. Serialization/deserialization is ugly for now, but works.

Tests are coming soon.

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

@github-actions
Copy link

github-actions bot commented Sep 17, 2025

Workflow [PR], commit [fd23354]

@ianton-ru
Copy link
Author

ianton-ru commented Sep 17, 2025

TODO:

  • tests

Optional TODO:

  • refactor to keep structures/classes (at least DataFileInfo) in separate header files to reduce file dependencies

@ianton-ru ianton-ru changed the title WIP: Read optimization using Iceberg metadata Read optimization using Iceberg metadata Sep 18, 2025
@ianton-ru ianton-ru marked this pull request as ready for review September 18, 2025 16:26
@hodgesrm
Copy link
Member

@ianton-ru what's the speedup you are seeing with this optimization? Does it match the query response of a simple SELECT count() as described in #1000?

@ianton-ru
Copy link
Author

@ianton-ru what's the speedup you are seeing with this optimization? Does it match the query response of a simple SELECT count() as described in #1000?

Yes, partially. Now not covered case with column renames and case with constant NULL . this pr is required - ClickHouse#85829. But on cases with constant non-NULL values in non-renamed columns ClickHouse should take count and values of constant columns from Iceberg metadata.

@ianton-ru
Copy link
Author

Test 03413_experimental_settings_cannot_be_enabled_by_default failed...

@Enmk Enmk merged commit a1c4e5e into antalya-25.6.5 Oct 1, 2025
111 of 136 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants