Skip to content

Conversation

@zvonand
Copy link
Collaborator

@zvonand zvonand commented Oct 20, 2025

Successor of #1008.

Closes #963, ClickHouse#84609. PR to upstream is upcoming

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Allow to read Iceberg data from any location

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@zvonand zvonand added antalya port-antalya PRs to be ported to all new Antalya releases antalya-25.8 labels Oct 20, 2025
@github-actions
Copy link

github-actions bot commented Oct 20, 2025

Workflow [PR], commit [2e942f9]

@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch 3 times, most recently from 71a727b to 8dee385 Compare October 22, 2025 16:13
@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch from a5703dd to dad344b Compare October 28, 2025 09:09
@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch 6 times, most recently from 84e859c to 9a8abbd Compare November 14, 2025 10:48
@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch 11 times, most recently from 95b9526 to 91a7e14 Compare November 21, 2025 09:13
@arthurpassos
Copy link
Collaborator

@codex review

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 344 to 348
std::cerr << "[MASTER IcebergIterator::next] file_path=" << manifest_file_entry.file_path
<< " resolved_key=" << resolved_key
<< " storage_type=" << (storage_to_use ? toString(storage_to_use->getType()) : "null")
<< " storage_desc=" << (storage_to_use ? storage_to_use->getDescription() : "null")
<< " storage_same_as_base=" << (storage_to_use == object_storage ? "true" : "false")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid uncontrolled stderr logging in iterator

IcebergIterator::next() now prints details of every manifest entry directly to std::cerr rather than through the logger. This runs for every data file yielded by the iterator, so a query that scans many files will spam the server’s stderr with unstructured output, bypassing log levels and incurring extra I/O cost. Please drop these ad‑hoc std::cerr writes or route them through the existing logger with configurable levels.

Useful? React with 👍 / 👎.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a desperate step to debug code: the fail does not reproduce locally or when running test suite locally, only in CI
These lines, of course, will be removed

Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Will do one more later

, underlying_format_read_schema_id(data_manifest_file_entry_.schema_id)
, sequence_number(data_manifest_file_entry_.added_sequence_number)
{
if (!position_deletes_objects.empty() && toupper(data_manifest_file_entry_.file_format) != "PARQUET")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the PR was about reading data from any location. What does it have to do with "position deletes"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we also use paths there

extern const SettingsString filesystem_cache_name;
extern const SettingsUInt64 filesystem_cache_boundary_alignment;
if (!isAbsolutePath(path))
return {base_storage, normalizePathToStorageRoot(table_location, path)}; // Relative path definitely goes to base storage
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is not absolute, it is relative. Why do you need to "normalize" it to root? Is it because you are now making everything relative to the root instead of the data directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we normalize it to storage root here. previously, paths were (and still could be) relative to table location.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment

@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch from 91a7e14 to 7f3d8dc Compare November 21, 2025 11:55
@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch from 891a001 to 7059ca8 Compare November 21, 2025 22:10
@zvonand zvonand requested a review from arthurpassos November 21, 2025 22:11
@arthurpassos
Copy link
Collaborator

@codex review

maybe I'm onto something

another attempt

polish
@zvonand zvonand force-pushed the bugfix/antalya-25.8/963 branch from 7059ca8 to 2e942f9 Compare November 22, 2025 00:30
@chatgpt-codex-connector
Copy link

💡 Codex Review

if (target_scheme_normalized == "s3" || target_scheme_normalized == "https" || target_scheme_normalized == "http")
{
std::string normalized_path = path;
if (target_decomposed.scheme == "s3a" || target_decomposed.scheme == "s3n")
{
normalized_path = "s3://" + target_decomposed.authority + "/" + target_decomposed.key;
}
S3::URI s3_uri(normalized_path);
std::string key_to_use = s3_uri.key;
bool use_base_storage = false;
if (base_storage->getType() == ObjectStorageType::S3)
{
if (auto s3_storage = std::dynamic_pointer_cast<S3ObjectStorage>(base_storage))
{
const std::string base_bucket = s3_storage->getObjectsNamespace();
const std::string base_endpoint = s3_storage->getDescription();
if (s3URIMatches(s3_uri, base_bucket, base_endpoint, target_scheme_normalized))
use_base_storage = true;
}
}
if (!use_base_storage && (base_scheme_normalized == "s3" || base_scheme_normalized == "https" || base_scheme_normalized == "http"))
{
std::string normalized_table_location = table_location;
if (table_location_decomposed.scheme == "s3a" || table_location_decomposed.scheme == "s3n")
{
normalized_table_location = "s3://" + table_location_decomposed.authority + "/" + table_location_decomposed.key;
}
S3::URI base_s3_uri(normalized_table_location);
if (s3URIMatches(s3_uri, base_s3_uri.bucket, base_s3_uri.endpoint, target_scheme_normalized))
use_base_storage = true;
}
if (use_base_storage)
return {base_storage, key_to_use};
const std::string storage_cache_key = "s3://" + s3_uri.bucket + "@" + (s3_uri.endpoint.empty() ? "amazonaws.com" : s3_uri.endpoint);
return getOrCreateStorageAndKey(
storage_cache_key,
key_to_use,
"s3",
secondary_storages,

P1 Badge Avoid forcing all HTTPS paths through S3 resolver

The new resolveObjectStorageForPath treats every https/http URI as S3 and unconditionally returns a S3 storage (lines 333‑379). For Iceberg data or manifests stored on a non‑S3 HTTPS endpoint (e.g., Azure https://account.blob.core.windows.net/... or GCS), this branch will now bypass the Azure/HDFS/file fallbacks and try to instantiate an S3 client with the wrong credentials, causing such tables to become unreadable. Previously those URLs would have been handled by the appropriate storage or the base storage, so this is a regression for any non‑S3 HTTPS paths in metadata.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Copy link
Collaborator

@arthurpassos arthurpassos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general it looks ok to me. We must push this to upstream as well, otherwise it will be a pain to rebase

: relative_path(std::move(command_or_path))
, metadata(std::move(metadata_))
, command(relative_path)
, absolute_path((absolute_path_.has_value() && !absolute_path_.value().empty()) ? absolute_path_ : std::nullopt)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If empty is being treated the same as std::nullopt, why not just std::string?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial idea was to make the object reading logic shorter and avoid a lot of if (!absolute_path.empty())
I agree, maybe it shall be refactored. But since this is a more or less style isue, I would prefer to merge it as is now not to delay release any further and improve it when submitting to upstream.

@zvonand
Copy link
Collaborator Author

zvonand commented Nov 22, 2025

We must push this to upstream as well

This is still my plan. I just avoided submitting a PR that is not ready -- there is then a bigger chance no one will ever notice it when it is ready.

@zvonand zvonand changed the title [WiP] Allow to read Iceberg data from any location Allow to read Iceberg data from any location Nov 22, 2025
@zvonand zvonand merged commit 3841418 into antalya-25.8 Nov 22, 2025
292 of 294 checks passed
@zvonand zvonand mentioned this pull request Nov 23, 2025
25 tasks
zvonand added a commit that referenced this pull request Nov 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-25.8 port-antalya PRs to be ported to all new Antalya releases

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants