Skip to content

feat!: introduce the ability for consumers to add ObjectStore url handlers#873

Merged
rtyler merged 3 commits intodelta-io:mainfrom
buoyant-data:hdfs-optional
Apr 23, 2025
Merged

feat!: introduce the ability for consumers to add ObjectStore url handlers#873
rtyler merged 3 commits intodelta-io:mainfrom
buoyant-data:hdfs-optional

Conversation

@rtyler
Copy link
Copy Markdown
Member

@rtyler rtyler commented Apr 22, 2025

This provides the ability for downstream crates to insert their own URL handlers, and offers a path for delta-kernel-rs to reduce its dependency on hdfs-native-object-store for all consumers of the crate.

I hacked this out modeling off something similar that I did in delta-rs to allow pluggable storage handlers, which is hugely beneficial to reducing the dependency scope of its core module. In discussing arrow-55 upgrade challenges with Zach I proposed this as a way to help simplify the upgrade path around arrow-55 and object-store-0.12. Side benefit of a slightly smaller dependency footprint for downstream consumers :)

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2025

Codecov Report

Attention: Patch coverage is 82.75862% with 10 lines in your changes missing coverage. Please review.

Project coverage is 85.11%. Comparing base (79dfea5) to head (d62bd77).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
kernel/src/engine/default/storage.rs 82.75% 7 Missing and 3 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #873      +/-   ##
==========================================
+ Coverage   85.05%   85.11%   +0.06%     
==========================================
  Files          85       85              
  Lines       20988    21026      +38     
  Branches    20988    21026      +38     
==========================================
+ Hits        17851    17896      +45     
+ Misses       2242     2233       -9     
- Partials      895      897       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added the breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump. label Apr 22, 2025
@rtyler rtyler marked this pull request as ready for review April 22, 2025 15:35
Copy link
Copy Markdown
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropping a few comments will come back soon!

Comment thread kernel/Cargo.toml Outdated
/// Insert a new URL handler for [parse_url_opts] with the given `scheme`. This allows users to
/// provide their own custom URL handler to plug new [object_store::ObjectStore] instances into
/// delta-kernel
pub fn insert_url_handler(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should add this to the default StorageHandler? Would it be easier to use there instead of just as a standalone? (could still stay a static function but just be invoked like StorageHandler::insert_url_handler?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently expose the storage handler as an arg in new. So we'd have to modify the trait in the core kernel to include this so that you could get_storage_handler and then insert_url_handler, which seems gross.

I'd say like this is okay for now, and we can refactor to pass the StorageHandler as part of the call to new in the future if we feel it's better.

if let Ok(mut lock) = registry.write() {
(*lock).insert(scheme.as_ref().into(), handler_closure);
} else {
panic!("Failed to acquire lock for adding a URL handler!");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return an error instead of panic!?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am kind of indifferent here, I think failing to insert a URL handler is a pretty fatal error, nothing will work properly if you were expecting to insert a URL handler and that failed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but given that panics are really a pain, especially over ffi, we aim for a "no panic" policy if possible in kernel, and should rather return an error. The engine gets to decide how fatal it is :)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fatal for Delta queries that need the URL handler perhaps, but surely not the engine as a whole? A panic would bring down the whole engine.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: A quick skim of the code base shows that ~all our lock acquires currently panic/unwrap. So I'm ok letting this PR merge as-is (doesn't make the problem worse), but we should track a TODO to fix them all ASAP.

Comment thread kernel/src/engine/default/storage.rs Outdated
…lers

This provides the ability for downstream crates to insert their own URL
handlers, and offers a path for delta-kernel-rs to reduce its dependency
on hdfs-native-object-store for all consumers of the crate.

I hacked this out modeling off something similar that I did in delta-rs
to allow pluggable storage handlers, which is hugely beneficial to
reducing the dependency scope of its core module. In discussing arrow-55
upgrade challenges with Zach I proposed this as a way to help simplify
the upgrade path around arrow-55 and object-store-0.12. Side benefit of
a slightly smaller dependency footprint for downstream consumers :)

Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
Copy link
Copy Markdown
Member

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, thanks tyler. This looks good, just a couple of nits.

if let Ok(mut lock) = registry.write() {
(*lock).insert(scheme.as_ref().into(), handler_closure);
} else {
panic!("Failed to acquire lock for adding a URL handler!");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but given that panics are really a pain, especially over ffi, we aim for a "no panic" policy if possible in kernel, and should rather return an error. The engine gets to decide how fatal it is :)

return handler(url, options);
}
}
parse_url_opts_object_store(url, options)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe

Suggested change
parse_url_opts_object_store(url, options)
object_store::parse_url_opts(url, options)

About the same length and we can remove the import.

/// This type alias makes it easier to reference the handler closure(s)
///
/// It uses a HashMap<String, String> which _must_ be converted in our [parse_url_opts] because we
/// cannot use generics in this scenario.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we say why. Specifically I think it's that Into is not "dyn compatible", so we can't just box up the traits because you cannot write:

Box<dyn IntoIterator<Item = dyn AsRef<str>, IntoIter = dyn Into<String>>>

/// Insert a new URL handler for [parse_url_opts] with the given `scheme`. This allows users to
/// provide their own custom URL handler to plug new [object_store::ObjectStore] instances into
/// delta-kernel
pub fn insert_url_handler(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently expose the storage handler as an arg in new. So we'd have to modify the trait in the core kernel to include this so that you could get_storage_handler and then insert_url_handler, which seems gross.

I'd say like this is okay for now, and we can refactor to pass the StorageHandler as part of the call to new in the future if we feel it's better.

Copy link
Copy Markdown
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after we handle those few nits/comments!!

let options: HashMap<String, String> = HashMap::default();
// Currently constructing an [HdfsObjectStore] won't work if there isn't an actual HDFS
// to connect to, so the only way to really verify that we got the object store we
// jxpected is to inspect the `store` on the error v_v
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// jxpected is to inspect the `store` on the error v_v
// expected is to inspect the `store` on the error v_v

if let Ok(mut lock) = registry.write() {
(*lock).insert(scheme.as_ref().into(), handler_closure);
} else {
panic!("Failed to acquire lock for adding a URL handler!");
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: A quick skim of the code base shows that ~all our lock acquires currently panic/unwrap. So I'm ok letting this PR merge as-is (doesn't make the problem worse), but we should track a TODO to fix them all ASAP.

@zachschuermann zachschuermann changed the title feat: introduce the ability for consumers to add ObjectStore url handlers feat!: introduce the ability for consumers to add ObjectStore url handlers Apr 23, 2025
@rtyler rtyler merged commit c036bed into delta-io:main Apr 23, 2025
20 of 21 checks passed
@Kimahriman
Copy link
Copy Markdown

Realized one of the downsides to this is that FFI usages (like duckdb extension) won't get HDFS support anymore because of this, unless they have their own wrapper project. Raised apache/arrow-rs-object-store#424 to hopefully solve all of these issues, but if that doesn't happen might need to add back in a custom feature for adding HDFS support

@rzepinskip
Copy link
Copy Markdown
Contributor

Realized one of the downsides to this is that FFI usages (like duckdb extension) won't get HDFS support anymore because of this, unless they have their own wrapper project. Raised apache/arrow-rs-object-store#424 to hopefully solve all of these issues, but if that doesn't happen might need to add back in a custom feature for adding HDFS support

cc: @samansmink might be of interest to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Public API change that could cause downstream compilation failures. Requires a major version bump.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants