-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[Variant] Support typed access for numeric types in variant_get #8179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Variant] Support typed access for numeric types in variant_get #8179
Conversation
parquet-variant/src/variant.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needed to support DataType::Float16.
Unrelated to this issue: Seems like it would make sense to also add Variant::as_f16 for symmetry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this variable into the macro itself, although passing this into the macro would also work.
|
FYI @carpecodeum and @scovich do you have some time to review this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if https://github.com/apache/arrow-rs/pull/8105/files#diff-ed469f33b865066aa56a96aa5279ce67d4bf85ea34c7a8c12ffb16afef107b6fR23-R31 merges, this simplifies to:
| for i in 0..$variant_array.len() { | |
| if primitive_array.is_null(i) { | |
| array_builder.append_null(); | |
| } else { | |
| let value = primitive_array.value(i); | |
| array_builder.append_variant(Variant::from(value)); | |
| } | |
| } | |
| primitive_conversion_array!($arrow_type, $typed_value, array_builder); |
TBD whether it's worth defining the macro just to turn 3 lines into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, re-using that macro after it merges would work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched over to using the merged macro.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: I'm a bit surprised that fmt didn't simplify these match arms:
| DataType::Int8 => { | |
| cast_partially_shredded_primitive!( | |
| typed_value, | |
| variant_array, | |
| Int8Type, | |
| DataType::Int8 | |
| ) | |
| } | |
| DataType::Int8 => cast_partially_shredded_primitive!( | |
| typed_value, | |
| variant_array, | |
| Int8Type, | |
| DataType::Int8 | |
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it's having trouble inferring things across the macro boundary, so it doesn't know that it can simplify the match arms. I've noticed that macros sometimes behave strangely when it comes to formatting(and LSP diagnostics).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually suspect fmt just plain isn't running, for some reason? I've seen other places in this file that aren't formatted as I would have expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an experiment with messing up the formatting of the file and running cargo fmt manually. It modifies the file back to it's current state. I don't have any rustfmt.toml customizations, so I think this is the formatted result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok... weird, but I guess fmt does what it does
95a4fab to
db5183d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why cargo fmt aligns the comment like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild guess: it would pass 100 char limit if indented like the others
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, re-using that macro after it merges would work.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @superserious-dev and @scovich -- this PR looks good to me. I think it needs to have some merge conflicts resolved and we'll be ready go merge it.
parquet-variant/src/variant.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that #8208 is done, we could use the ReadOnlyMetadataBuilder (as a follow on PR)
db5183d to
49390f4
Compare
|
Incorporated @scovich's suggestion and rebased on the latest main. One question: This PR introduces a few macros(eg. arrow-rs/parquet-variant-compute/src/variant_get/mod.rs Lines 368 to 387 in 1dacecb
Any suggestions for whether to keep the macros in this PR or switch over to the generic functions? There are also quite a few conversion left, so trying to consider what would work well for those.
|
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems good enough to merge, possibly with a couple tweaks to tests (see below)
Open questions/follow-ups in my mind:
-
re
Can use
variant_getfor shredded numeric typesWhat happens if
variant_getencounters atyped_value: LONGat the requested path, but the caller requestedSome(Int32)type that doesn't match? Is there code somewhere else to handle the casting? (I don't see any unit tests that cover the case, either) -
this PR modifies
OutputBuilderand the types that implement it, while another open PR aims to delete the trait entirely:But that other PR seems to have stalled due to questions around null mask handling (**), and the tests in this PR will remain useful even if the implementation changes later.
Should we go ahead and merge this PR even knowing a lot of the code it touches will need to change Real Soon?
(**) Honestly, we should probably let the other PR merge without addressing the null mask question, because it's not making the problem worse.
-
re
the macro approach reduces some boilerplate while making it harder to read the tests
Any suggestions for whether to keep the macros in this PR or switch over to the generic functions?
I left a couple comments in the code with my suggestions. Nothing drastic.
| ($test:ident, $data_fn:ident, $primitive_type:ty) => { | ||
| #[test] | ||
| fn $test() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you're worried about readability -- I'm not sure we gain a lot by pulling the test function definition inside the macro? What if we left that to the caller, and swapped the remaining two args, so that:
numeric_partially_shredded_test!(
get_variant_partially_shredded_int8_as_variant,
partially_shredded_int8_variant_array,
i8
);becomes
#[test]
fn get_variant_partially_shredded_int8_as_variant() {
numeric_partially_shredded_test!(i8, partially_shredded_int8_variant_array);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated. I agree that it looks more readable this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild guess: it would pass 100 char limit if indented like the others
| use std::sync::Arc; | ||
|
|
||
| macro_rules! cast_partially_shredded_primitive { | ||
| ($typed_value:expr, $variant_array:expr, $arrow_type:ty, $data_type:expr) => {{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
$data_type arg is unused?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch; removed.
|
|
||
| macro_rules! cast_partially_shredded_primitive { | ||
| ($typed_value:expr, $variant_array:expr, $arrow_type:ty, $data_type:expr) => {{ | ||
| let mut array_builder = VariantArrayBuilder::new($variant_array.len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look quite right? Seems like we should either
- Propagate the existing
metadatacolumn unchanged, or - Create a new column of empty metadata, since these are all primitives that don't need field names in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually... the current code is taking approach 2/, and is correct.
It could be optimized to just stamp out the empty metadata directly (or even use a dictionary array to reference the same one over and over).
If we pursued approach 1/, we wouldn't have to do any work at all -- just arc clone the existing column -- but we'd need to create a new kind of variant array builder that doesn't try to create a metadata column.
Either way, it seems like a potential TODO to track, rather than something to pursue in this PR.
| match typed_value.data_type() { | ||
| DataType::Int32 => { | ||
| let primitive_array = typed_value.as_primitive::<Int32Type>(); | ||
| for i in 0..variant_array.len() { | ||
| if variant_array.is_null(i) { | ||
| array_builder.append_null(); | ||
| continue; | ||
| } | ||
|
|
||
| if typed_value.is_null(i) { | ||
| // fall back to the value (variant) field | ||
| // (TODO could copy the variant bytes directly) | ||
| let value = variant_array.value(i); | ||
| array_builder.append_variant(value); | ||
| continue; | ||
| } | ||
|
|
||
| // otherwise we have a typed value, so we can use it directly | ||
| let int_value = primitive_array.value(i); | ||
| array_builder.append_variant(Variant::from(int_value)); | ||
| } | ||
| } | ||
| DataType::Int8 => cast_partially_shredded_primitive!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like something that should be handled by downcast_primitive_array, except we don't support all primitive types yet?
This match is mostly dealing with arrow data types and primitive downcasting, which that macro should handle automagically?
downcast_primitive_array!(
typed_value => {
let mut array_builder = VariantArrayBuilder::new(variant_array.len());
for i in 0..variant_array.len() {
if variant_array.is_null(i) {
array_builder.append_null();
continue;
}
// We must use the typed_value, if present; otherwise fall back to the value
let value = if typed_value.is_null(i) {
variant_array.value(i) // TODO: copy the variant bytes directly?
} else {
typed_value.value(i).into()
};
array_builder.append_variant(value);
}
Ok(Arc::new(array_builder.build()))
}
dt => {
... same error as today ...
}
}|
re casting, I left a comment here: #8087 (comment). Since casting is an extra step after typed access, I suggested implementing the conversion logic in #8086. |
Any thoughts on this, @superserious-dev ? |
Currently this PR only covers the case where the user has not requested a specific output type. I noticed there was a separate issue/discussion #8086 for casting, so I didn't implement it here. However I'd be fine with amending this PR to include the casting behavior for numeric types, if that is preferred. For reference, the code for determining typed output( arrow-rs/parquet-variant-compute/src/variant_get/output/mod.rs Lines 63 to 87 in 21a9a2a
variant_get here:
There's a set of pre-existing tests that cover the specific case of int32 -> int32. This could be macro-ified for coverage of all the different combinations of numeric types, eg. int32->int64. arrow-rs/parquet-variant-compute/src/variant_get/mod.rs Lines 223 to 259 in 1dacecb
|
Ah, that makes it easy then! Thanks for the clear explanation. |
|
Sorry for the delay here. I was out for the last week but I am back now and catching up on PRs. I plan to merge this once CI passes |
|
Thanks again @superserious-dev and @scovich |
# Which issue does this PR close? - Closes #8150 - closes #8166 # Rationale for this change Add support for extracting fields from both shredded and non-shredded variant arrays at any depth (like "x", "a.x", "a.b.x") and casting them to Int32 with proper NULL handling for type mismatches. NOTE: This is a second attempt at * #8166 which suffered strong logical conflicts with * #8179 and which itself grew out of * #8122 See the other two PR for the vast majority of review commentary relating to this change. I started from the original PR commits (first three), performed the merge, and fixed up a bunch of issues. Manually diffing the before ([76b75ee..882aa4d](https://github.com/apache/arrow-rs/compare/76b75eebc..882aa4d69)) and after ([0ba91ae..f6fd915](https://github.com/apache/arrow-rs/compare/0ba91aed9..f6fd91583)) diffs gives the following non-trivial differences vs. the original PR * Ran `cargo fmt` * `typed_value_to_variant` now supports all primitive numeric types (previously only int16) * cast options plumbed through and respected * Fix a null buffer bug in `shredded_get_path` -- the original code was wrongly unioning in the null buffer from `typed_value` column: ```patch // Path exhausted! Create a new `VariantArray` for the location we landed on. - // Also union nulls from the final typed_value field we landed on - if let Some(typed_value) = shredding_state.typed_value_field() { - accumulated_nulls = arrow::buffer::NullBuffer::union( - accumulated_nulls.as_ref(), - typed_value.nulls(), - ); - } let target = make_target_variant( shredding_state.value_field().cloned(), shredding_state.typed_value_field().cloned(), accumulated_nulls, ); ``` * Remove the `get_variant_perfectly_shredded_int32_as_variant` test case, because #8179 introduced a battery of unit tests that cover the same functionality. * Remove now-unnecessary `.unwrap()` calls from object builder `finish` calls in unit tests * Fixed broken test code in `create_depth_1_shredded_test_data_working`, which captured the return value of a nested builder's `finish` (`()`) instead of the return value of the top-level builder. I'm not quite sure what this code was trying to do, but I changed it to just create a nested builder instead of a second top-level builder: ```patch fn create_depth_1_shredded_test_data_working() -> ArrayRef { // Create metadata following the working pattern from shredded_object_with_x_field_variant_array let (metadata, _) = { - let a_variant = { - let mut a_builder = parquet_variant::VariantBuilder::new(); - let mut a_obj = a_builder.new_object(); - a_obj.insert("x", Variant::Int32(55)); // "a.x" field (shredded when possible) - a_obj.finish().unwrap() - }; let mut builder = parquet_variant::VariantBuilder::new(); let mut obj = builder.new_object(); - obj.insert("a", a_variant); + + // Create the nested "a" object + let mut a_obj = obj.new_object("a"); + a_obj.insert("x", Variant::Int32(55)); + a_obj.finish(); + obj.finish().unwrap(); builder.finish() }; ``` * Similar fix (twice, `a_variant` and `b_variant`) for `create_depth_2_shredded_test_data_working` * `make_shredding_row_builder` now supports signed int and float types (unsigned int not supported yet) * A new `get_type_name` helper in row_builder.rs that gives human-readable data type names. I'm not convinced it's necessary (and the code is in the wrong spot, jammed in the middle of `VariantAsPrimitive` code. * `impl VariantAsPrimitive` for all signed int and float types * `PrimitiveVariantShreddingRowBuilder` now has a lifetime param because it takes a reference to cast options (it now respects unsafe vs. safe casting) # What changes are included in this PR? Everything in the original PR, plus merge in the main branch, fix logical conflicts and fix various broken tests. # Are these changes tested? All unit tests now pass. # Are there any user-facing changes? No (variant is not public yet) --------- Co-authored-by: carpecodeum <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Are these changes tested?
Yes
Are there any user-facing changes?
Can use
variant_getfor shredded numeric types