feat!: introduce the ability for consumers to add ObjectStore url handlers#873
feat!: introduce the ability for consumers to add ObjectStore url handlers#873rtyler merged 3 commits intodelta-io:mainfrom
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #873 +/- ##
==========================================
+ Coverage 85.05% 85.11% +0.06%
==========================================
Files 85 85
Lines 20988 21026 +38
Branches 20988 21026 +38
==========================================
+ Hits 17851 17896 +45
+ Misses 2242 2233 -9
- Partials 895 897 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
zachschuermann
left a comment
There was a problem hiding this comment.
dropping a few comments will come back soon!
| /// Insert a new URL handler for [parse_url_opts] with the given `scheme`. This allows users to | ||
| /// provide their own custom URL handler to plug new [object_store::ObjectStore] instances into | ||
| /// delta-kernel | ||
| pub fn insert_url_handler( |
There was a problem hiding this comment.
I wonder if we should add this to the default StorageHandler? Would it be easier to use there instead of just as a standalone? (could still stay a static function but just be invoked like StorageHandler::insert_url_handler?
There was a problem hiding this comment.
We don't currently expose the storage handler as an arg in new. So we'd have to modify the trait in the core kernel to include this so that you could get_storage_handler and then insert_url_handler, which seems gross.
I'd say like this is okay for now, and we can refactor to pass the StorageHandler as part of the call to new in the future if we feel it's better.
| if let Ok(mut lock) = registry.write() { | ||
| (*lock).insert(scheme.as_ref().into(), handler_closure); | ||
| } else { | ||
| panic!("Failed to acquire lock for adding a URL handler!"); |
There was a problem hiding this comment.
can we return an error instead of panic!?
There was a problem hiding this comment.
I am kind of indifferent here, I think failing to insert a URL handler is a pretty fatal error, nothing will work properly if you were expecting to insert a URL handler and that failed
There was a problem hiding this comment.
Agreed, but given that panics are really a pain, especially over ffi, we aim for a "no panic" policy if possible in kernel, and should rather return an error. The engine gets to decide how fatal it is :)
There was a problem hiding this comment.
Fatal for Delta queries that need the URL handler perhaps, but surely not the engine as a whole? A panic would bring down the whole engine.
There was a problem hiding this comment.
Update: A quick skim of the code base shows that ~all our lock acquires currently panic/unwrap. So I'm ok letting this PR merge as-is (doesn't make the problem worse), but we should track a TODO to fix them all ASAP.
…lers This provides the ability for downstream crates to insert their own URL handlers, and offers a path for delta-kernel-rs to reduce its dependency on hdfs-native-object-store for all consumers of the crate. I hacked this out modeling off something similar that I did in delta-rs to allow pluggable storage handlers, which is hugely beneficial to reducing the dependency scope of its core module. In discussing arrow-55 upgrade challenges with Zach I proposed this as a way to help simplify the upgrade path around arrow-55 and object-store-0.12. Side benefit of a slightly smaller dependency footprint for downstream consumers :) Signed-off-by: R. Tyler Croy <rtyler@brokenco.de>
nicklan
left a comment
There was a problem hiding this comment.
nice, thanks tyler. This looks good, just a couple of nits.
| if let Ok(mut lock) = registry.write() { | ||
| (*lock).insert(scheme.as_ref().into(), handler_closure); | ||
| } else { | ||
| panic!("Failed to acquire lock for adding a URL handler!"); |
There was a problem hiding this comment.
Agreed, but given that panics are really a pain, especially over ffi, we aim for a "no panic" policy if possible in kernel, and should rather return an error. The engine gets to decide how fatal it is :)
| return handler(url, options); | ||
| } | ||
| } | ||
| parse_url_opts_object_store(url, options) |
There was a problem hiding this comment.
nit: maybe
| parse_url_opts_object_store(url, options) | |
| object_store::parse_url_opts(url, options) |
About the same length and we can remove the import.
| /// This type alias makes it easier to reference the handler closure(s) | ||
| /// | ||
| /// It uses a HashMap<String, String> which _must_ be converted in our [parse_url_opts] because we | ||
| /// cannot use generics in this scenario. |
There was a problem hiding this comment.
Can we say why. Specifically I think it's that Into is not "dyn compatible", so we can't just box up the traits because you cannot write:
Box<dyn IntoIterator<Item = dyn AsRef<str>, IntoIter = dyn Into<String>>>| /// Insert a new URL handler for [parse_url_opts] with the given `scheme`. This allows users to | ||
| /// provide their own custom URL handler to plug new [object_store::ObjectStore] instances into | ||
| /// delta-kernel | ||
| pub fn insert_url_handler( |
There was a problem hiding this comment.
We don't currently expose the storage handler as an arg in new. So we'd have to modify the trait in the core kernel to include this so that you could get_storage_handler and then insert_url_handler, which seems gross.
I'd say like this is okay for now, and we can refactor to pass the StorageHandler as part of the call to new in the future if we feel it's better.
zachschuermann
left a comment
There was a problem hiding this comment.
LGTM after we handle those few nits/comments!!
| let options: HashMap<String, String> = HashMap::default(); | ||
| // Currently constructing an [HdfsObjectStore] won't work if there isn't an actual HDFS | ||
| // to connect to, so the only way to really verify that we got the object store we | ||
| // jxpected is to inspect the `store` on the error v_v |
There was a problem hiding this comment.
| // jxpected is to inspect the `store` on the error v_v | |
| // expected is to inspect the `store` on the error v_v |
| if let Ok(mut lock) = registry.write() { | ||
| (*lock).insert(scheme.as_ref().into(), handler_closure); | ||
| } else { | ||
| panic!("Failed to acquire lock for adding a URL handler!"); |
There was a problem hiding this comment.
Update: A quick skim of the code base shows that ~all our lock acquires currently panic/unwrap. So I'm ok letting this PR merge as-is (doesn't make the problem worse), but we should track a TODO to fix them all ASAP.
|
Realized one of the downsides to this is that FFI usages (like duckdb extension) won't get HDFS support anymore because of this, unless they have their own wrapper project. Raised apache/arrow-rs-object-store#424 to hopefully solve all of these issues, but if that doesn't happen might need to add back in a custom feature for adding HDFS support |
cc: @samansmink might be of interest to you. |
This provides the ability for downstream crates to insert their own URL handlers, and offers a path for delta-kernel-rs to reduce its dependency on hdfs-native-object-store for all consumers of the crate.
I hacked this out modeling off something similar that I did in delta-rs to allow pluggable storage handlers, which is hugely beneficial to reducing the dependency scope of its core module. In discussing arrow-55 upgrade challenges with Zach I proposed this as a way to help simplify the upgrade path around arrow-55 and object-store-0.12. Side benefit of a slightly smaller dependency footprint for downstream consumers :)