-
Notifications
You must be signed in to change notification settings - Fork 11
Allow to read Iceberg data from any location #1092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
71a727b to
8dee385
Compare
a5703dd to
dad344b
Compare
84e859c to
9a8abbd
Compare
95b9526 to
91a7e14
Compare
|
@codex review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| std::cerr << "[MASTER IcebergIterator::next] file_path=" << manifest_file_entry.file_path | ||
| << " resolved_key=" << resolved_key | ||
| << " storage_type=" << (storage_to_use ? toString(storage_to_use->getType()) : "null") | ||
| << " storage_desc=" << (storage_to_use ? storage_to_use->getDescription() : "null") | ||
| << " storage_same_as_base=" << (storage_to_use == object_storage ? "true" : "false") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid uncontrolled stderr logging in iterator
IcebergIterator::next() now prints details of every manifest entry directly to std::cerr rather than through the logger. This runs for every data file yielded by the iterator, so a query that scans many files will spam the server’s stderr with unstructured output, bypassing log levels and incurring extra I/O cost. Please drop these ad‑hoc std::cerr writes or route them through the existing logger with configurable levels.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a desperate step to debug code: the fail does not reproduce locally or when running test suite locally, only in CI
These lines, of course, will be removed
arthurpassos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass. Will do one more later
src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergDataObjectInfo.cpp
Outdated
Show resolved
Hide resolved
| , underlying_format_read_schema_id(data_manifest_file_entry_.schema_id) | ||
| , sequence_number(data_manifest_file_entry_.added_sequence_number) | ||
| { | ||
| if (!position_deletes_objects.empty() && toupper(data_manifest_file_entry_.file_format) != "PARQUET") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the PR was about reading data from any location. What does it have to do with "position deletes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we also use paths there
src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergIterator.cpp
Outdated
Show resolved
Hide resolved
| extern const SettingsString filesystem_cache_name; | ||
| extern const SettingsUInt64 filesystem_cache_boundary_alignment; | ||
| if (!isAbsolutePath(path)) | ||
| return {base_storage, normalizePathToStorageRoot(table_location, path)}; // Relative path definitely goes to base storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is not absolute, it is relative. Why do you need to "normalize" it to root? Is it because you are now making everything relative to the root instead of the data directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we normalize it to storage root here. previously, paths were (and still could be) relative to table location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a comment
91a7e14 to
7f3d8dc
Compare
891a001 to
7059ca8
Compare
|
@codex review |
maybe I'm onto something another attempt polish
7059ca8 to
2e942f9
Compare
💡 Codex ReviewClickHouse/src/Storages/ObjectStorage/Utils.cpp Lines 333 to 379 in 7059ca8
The new ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
arthurpassos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general it looks ok to me. We must push this to upstream as well, otherwise it will be a pain to rebase
| : relative_path(std::move(command_or_path)) | ||
| , metadata(std::move(metadata_)) | ||
| , command(relative_path) | ||
| , absolute_path((absolute_path_.has_value() && !absolute_path_.value().empty()) ? absolute_path_ : std::nullopt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If empty is being treated the same as std::nullopt, why not just std::string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial idea was to make the object reading logic shorter and avoid a lot of if (!absolute_path.empty())
I agree, maybe it shall be refactored. But since this is a more or less style isue, I would prefer to merge it as is now not to delay release any further and improve it when submitting to upstream.
This is still my plan. I just avoided submitting a PR that is not ready -- there is then a bigger chance no one will ever notice it when it is ready. |
Successor of #1008.
Closes #963, ClickHouse#84609. PR to upstream is upcoming
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Allow to read Iceberg data from any location
CI/CD Options
Exclude tests:
Regression jobs to run: