-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Add comparison support for Union arrays #8838
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
0ceac84 to
87c792b
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew -- this looks good to me. The only thing I think it needs are some tests of the error cases -- namely
- Compare an out of bounds index (expect panic)
- Try to compare incompatible union types
I also left some other small suggestions but I don't think they are needed
| let left = left.as_union(); | ||
| let right = right.as_union(); | ||
|
|
||
| let (left_fields, left_mode) = match left.data_type() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird to have to re-check the DataTypes.
What would you think about adding UnionArray::fields() and UnionArray::mode() methods to make the code easier to work with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be super quick to review: #8884
Somewhat related but it feels a bit weird that the following works without any notice to the user:
#[test]
fn test_union_fields() {
let ids = vec![0, 1, 2];
let field = Field::new("a", DataType::Binary, true);
// different length of ids and fields (we zip so we truncate the longer vec)
let _out = UnionFields::new(ids.clone(), vec![field.clone()]);
// duplicate fields associated with different type ids!
let _out = UnionFields::new(ids, vec![field.clone(), field]);
}I feel like we could benefit from a bit more validation? We could leave UnionFields::new but also have a UnionFields::try_new that checks the above 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think that sounds like a good idea to me
We can even deprecate UnionFields::new to help people migrate over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it is: #8891
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is another minor convenience improvement: #8895
arrow-ord/src/ord.rs
Outdated
|
|
||
| if left_fields != right_fields || left_mode != right_mode { | ||
| return Err(ArrowError::InvalidArgumentError( | ||
| "Cannot compare UnionArrays with different fields or modes".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding more details to this message to help when people hit it -- specifically, I recommend
- a separate message for different modes (and include the modes in the error message)
- Add the fields (
{fields:?}style) to the message
|
|
||
| let c_opts = child_opts(opts); | ||
|
|
||
| let mut field_comparators = HashMap::with_capacity(left_fields.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than a hash map you could potentially just use a 128 valued Vec<> indexed by the typeids -- since typeid is i8 you know there can be at most 128 values that might be faster to lookup than hashing/hash table
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm so this was my first thought/approach as well, but I decided to use a hashmap because it avoids superfluous memory usage for sparse sets
Plus, I don't think this is a very hot path, so any perf differences wouldn't be super meaningful
779ea23 to
ad9027b
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @friendlymatthew
# Which issue does this PR close? This PR adds another method on the `UnionArray` api that returns a list of `FieldRef`s associated with the union type See: #8838 (comment)
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @friendlymatthew
Which issue does this PR close?
cmpkernel #8837Uniondata types for row format #8828Rationale for this change
This PR implements comparison functionality for Union arrays. This implementation follows a simple ordering strategy where unions are first compared by their type identifier, and only when type identifiers match are the actual values within those types compared
This approach handles both sparse and dense union modes correctly by using offsets when present (dense unions) or direct indices (sparse unions) to locate the appropriate child array values