Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR implements comparison functionality for Union arrays. This implementation follows a simple ordering strategy where unions are first compared by their type identifier, and only when type identifiers match are the actual values within those types compared

This approach handles both sparse and dense union modes correctly by using offsets when present (dense unions) or direct indices (sparse unions) to locate the appropriate child array values

@github-actions github-actions bot added the arrow Changes to the arrow crate label Nov 13, 2025
@friendlymatthew
Copy link
Contributor Author

cc @paddyhoran @alamb

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @friendlymatthew -- this looks good to me. The only thing I think it needs are some tests of the error cases -- namely

  1. Compare an out of bounds index (expect panic)
  2. Try to compare incompatible union types

I also left some other small suggestions but I don't think they are needed

let left = left.as_union();
let right = right.as_union();

let (left_fields, left_mode) = match left.data_type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird to have to re-check the DataTypes.

What would you think about adding UnionArray::fields() and UnionArray::mode() methods to make the code easier to work with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be super quick to review: #8884

Somewhat related but it feels a bit weird that the following works without any notice to the user:

#[test]
fn test_union_fields() {
    let ids = vec![0, 1, 2];
    let field = Field::new("a", DataType::Binary, true);

    // different length of ids and fields (we zip so we truncate the longer vec)
    let _out = UnionFields::new(ids.clone(), vec![field.clone()]);

    // duplicate fields associated with different type ids!
    let _out = UnionFields::new(ids, vec![field.clone(), field]);
}

I feel like we could benefit from a bit more validation? We could leave UnionFields::new but also have a UnionFields::try_new that checks the above 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that sounds like a good idea to me

We can even deprecate UnionFields::new to help people migrate over

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it is: #8891

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is another minor convenience improvement: #8895


if left_fields != right_fields || left_mode != right_mode {
return Err(ArrowError::InvalidArgumentError(
"Cannot compare UnionArrays with different fields or modes".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding more details to this message to help when people hit it -- specifically, I recommend

  1. a separate message for different modes (and include the modes in the error message)
  2. Add the fields ({fields:?} style) to the message


let c_opts = child_opts(opts);

let mut field_comparators = HashMap::with_capacity(left_fields.len());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than a hash map you could potentially just use a 128 valued Vec<> indexed by the typeids -- since typeid is i8 you know there can be at most 128 values that might be faster to lookup than hashing/hash table

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm so this was my first thought/approach as well, but I decided to use a hashmap because it avoids superfluous memory usage for sparse sets

Plus, I don't think this is a very hot path, so any perf differences wouldn't be super meaningful

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/compare-union branch from 779ea23 to ad9027b Compare November 19, 2025 19:41
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew

friendlymatthew added a commit to pydantic/arrow-rs that referenced this pull request Nov 20, 2025
alamb pushed a commit that referenced this pull request Nov 24, 2025
# Which issue does this PR close?

This PR adds another method on the `UnionArray` api that returns a list
of `FieldRef`s associated with the union type

See: #8838 (comment)
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb merged commit a8a63c2 into apache:main Nov 24, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add comparison support for Union arrays in the cmp kernel

3 participants