-
Notifications
You must be signed in to change notification settings - Fork 9
Glue: Deduce Iceberg table metadata location if metadata_location not specified
#1070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
bbd2471 to
b1229b7
Compare
88224fa to
53eb8cc
Compare
metadata_location not specified
499df74 to
52ffc6c
Compare
arthurpassos
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall it looks good, just a couple of minor adjustments.
Btw, is it possible to add a test? If it's too hard / we are running against the clock, then it is ok.
| // Construct path to version-hint.text | ||
| String version_hint_path = table_location + "metadata/version-hint.text"; | ||
|
|
||
| DB::ASTStorage * storage = table_engine_definition->as<DB::ASTStorage>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code seems to be duplicated with GlueCatalog::classifyTimestampTZ. Instictively I would ask you to consider creating a function for this, but I am under the impression this PR is not going to upstream, right? Extracting it into a function would make rebasing harder, so it is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, why is it not going into upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'd keep it as is for now. Later I will bring this PR to upstream, and then a small refactor will be done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, why is it not going into upstream?
Because we want it in the next 25.8 antalya release ASAP
| String version_hint_object_path = version_hint_path; | ||
| if (version_hint_object_path.starts_with("s3://")) | ||
| { | ||
| version_hint_object_path = version_hint_object_path.substr(5); | ||
| // Remove bucket from path | ||
| std::size_t pos = version_hint_object_path.find('/'); | ||
| if (pos != std::string::npos) | ||
| version_hint_object_path = version_hint_object_path.substr(pos + 1); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using S3::URI, I think it offers the functionality you need. All you gotta do is to instantiate it passing the path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here again, I am keeping the code that is similar to upstream. This is a good note, I will keep it in mind when making a PR to upstream
|
|
||
| return table_location + "metadata/v" + version_str + "-metadata.json"; | ||
| } | ||
| catch (...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of nest try-catch blocks, consider creating two auxiliary functions that return a std::optional
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we are trying to read objects that may not exist at all -- for we still need to have try/catch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but instead of having nested try catch blocks, you can have two different functions that return std::optional
Wouldn't something like the below work?
std::optional<std::string> resolveApproach1(...)
{
try
{
// stuff that may throw
return resolved_path;
}
catch(...)
{
return std::nullopt;
}
}
std::optional<std::string> resolveApproach2(...)
{
try
{
// stuff that may throw
return resolved_path;
}
catch(...)
{
return std::nullopt;
}
}
String GlueCatalog::resolveMetadataPathFromTableLocation(const String & table_location, const TableMetadata & table_metadata) const
{
....
if (const auto path = resolveApproach1())
{
return path;
}
if (const auto path = resolveApproach2())
{
return path;
}
return std::nulopt;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe
But I actually do not think it shall be done here (as it is a style and code beauty issue, though I agree that what you suggest looks better). I will probably refactor some of this code when submitting into upstream (e.g. as suggested in your other comments), so I do not see a good reason to spend time on it now.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
Antalya 25.6.5 Backport of 87733: Fix handling of timestamp(tz) columns in Glue Catalog
Not really. IDK if a good Glue emulator exists. But the thing that existing tests work with != AWS Glue. So, it can only be tested against real AWS. |
03ae62a to
0402d8d
Compare
Also includes #1053 (ClickHouse#87733)
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Deduce Iceberg metadata from Glue's
LocationExclude tests: