Skip to content

Conversation

@harshmotw-db
Copy link
Contributor

@harshmotw-db harshmotw-db commented Jun 25, 2025

Which issue does this PR close?

Rationale for this change

Explained in the issue.

What changes are included in this PR?

This PR includes a json_to_variant API to parse JSON strings as Variants. json_to_variant takes as argument the input JSON string and an object builder of type VariantBuilder and builds the variant using builder. The resulting variant can be extracted using builder.finish() which consumes the builder and returns the Variant buffers.

Are these changes tested?

Unit Tests, and an example file.

Are there any user-facing changes?

Yes, the PR introduces the json_to_variant API.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2025
@harshmotw-db
Copy link
Contributor Author

cc @scovich

@alamb
Copy link
Contributor

alamb commented Jun 25, 2025

FYI @carpecodeum

This is amazing @harshmotw-db -- I was just thinking about this.

I started the CI checks and will check this PR more carefully tomorrow

@harshmotw-db harshmotw-db changed the title add json_to_variant [VARIANT] Add support for the json_to_variant API Jun 25, 2025
serde_json = "1.0"
# "arbitrary_precision" allows us to manually parse numbers. "preserve_order" does not automatically
# sort object keys
serde_json = { version = "1.0", features = ["arbitrary_precision", "preserve_order"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL about the (undocumented) arbitrary_precision feature flag and (undocumented) Number::as_str method it unlocks 🤯

That could solve a lot of problems for variant decimals on the to_json path as well:
https://github.com/apache/arrow-rs/blob/main/parquet-variant/src/to_json.rs#L147-L162

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question tho -- do we need to support both modes of serde_json, given that the user (not arrow-rs) probably decides whether to use that feature flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible some projects might prioritize performance and might not want the Decimal type at all. I would prefer doing that as a follow up though. If you agree, I'll go ahead and create an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, for sure it's a follow-up. But we probably do need a story for how to handle that feature flag in a dependency (maybe part of the bigger story of how to handle a serde_json dependency in the first place).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 1040 to 1042
if n.is_i64() {
// Find minimum Integer width to fit
let i = n.as_i64().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if n.is_i64() {
// Find minimum Integer width to fit
let i = n.as_i64().unwrap();
if let Some(i) = n.is_i64() {
// Find minimum Integer width to fit

Comment on lines 1043 to 1051
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting approach. That double .. as .. as .. definitely caused a double take, but I guess it could be cheaper than TryFrom?

Suggested change
if i as i8 as i64 == i {
Ok((i as i8).into())
} else if i as i16 as i64 == i {
Ok((i as i16).into())
} else if i as i32 as i64 == i {
Ok((i as i32).into())
} else {
Ok(i.into())
}
let value = i8::try_from(i)
.map(Variant::from)
.or_else(|_| i16::try_from(i).map(Variant::from))
.or_else(|_| i32::try_from(i).map(Variant::from))
.unwrap_or_else(|| Variant::from(i));
Ok(value)

Comment on lines 36 to 38
pub struct SampleBoxBasedVariantBufferManager {
pub value_buffer: Box<[u8]>,
pub metadata_buffer: Box<[u8]>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose would this serve? It seems strictly inferior to a vec for both normal and testing uses?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing the above thought about building up an arrow Array... maybe a SampleMultiVariantBufferManager that can stack a bunch of variants end to end, with a len method that allows to extract offsets?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I do plan on adding an arrow-compliant sample buffer manager. I currently do not fully understand arrow though but I believe the trait is simple enough that it could be extended to such use cases. I will do that as a follow up if you agree with the general design.

@@ -0,0 +1,124 @@
use arrow_schema::ArrowError;

pub trait VariantBufferManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand this trait.

It almost seems like a good building block for eventual columnar writes, if an implementation took a mutable reference to a vec, where each VariantBufferManager you build with it is only capable of appending new bytes to the end (leaving existing bytes unchanged). That way, one could build up an offset array along the way as part of building up arrow binary arrays for the metadata and value columns?

But I don't understand the whole "may be called several times" part, which seems to return the whole slice on every call? I guess that's because the variant builder doesn't do its work all at once, but still expects to receive the overall slice?

Finally, why split the ensure and borrow methods, when both take &mut self and it's almost certainly dangerous to "borrow" without an "ensure" first (can't guess the size in advance, it's variant data)? Why not just a pair of ensure_and_borrow_XXX_buffer methods that return a slice of exactly the requested size (no larger, no smaller)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's reasonable to return the whole slice, and have the Variant builder choose to write data wherever it chooses to. The trait was originally created for this PR where bytes are often being shifted to make way for headers etc. Let me know if you disagree or would recommend a different simple alternative.

As for why the methods are separate, I think you are generally correct and they could be combined. In the original PR, I was calling ensure_value_buffer_size only occasionally but that was because I was performing size checks in the variant library before calling the method and that check could be moved into the method.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have consolidated the two methods.

Copy link
Contributor

@alamb alamb Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that currently the VariantBuilders need to write into a memory buffer anyways, I wonder what is this trait trying to abstract?

Like maybe the json_to_variant should take a VariantBuilder to write to and not try to manage buffers itself 🤔

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my opinion is that the variant builders should write to buffers owned by the caller. So no more ValueBuffer and MetadataBuffer - just VariantBufferManager.
delta-io/delta-kernel-rs#1034
This PR just lays the groundwork for that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having VariantBuilder write to buffers managed by somewhere else is a good idea and is useful more generally than json_to_variant

Perhaps we can make the json_to_variant API be in terms of the VariantBuilder and then modify the VariantBuilder in a separate PR to handle buffers owned by the caller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I filed a ticket to track the idea of writing to buffers owned by the caller here

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better if the VariantBuilder dependency was abstracted away? So, right now the functionality looks like: (1) here's the JSON string, (2) Here are my buffers (variant_buffer_manager), write the variant to the buffers.

The alternative is: (1) Here's the JSON string, (2) Here are my buffers, (3) this is the VariantBuilder built using my buffers (which I don't know why you need) => write the variants to these buffers.

I think the current approach VariantBufferManager should be reasonably extensible to every use case and also supports resizing on the fly. The example solution described in this ticket would likely be retry based which would have a higher performance overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, yeah, in the short run, I think it is good not to have dependency on VariantBufferManager since there are unanswered questions. I will implement the VariantBuilder-based API today

Copy link
Contributor Author

@harshmotw-db harshmotw-db Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I have removed the dependency of the current PR on custom buffer management and implemented what you suggested.

use arrow_schema::ArrowError;
use parquet_variant::{json_to_variant, VariantBufferManager};

pub struct SampleVariantBufferManager {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this different from the box-based one above, and why not just use the vec-based one above instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I forgot to replace this

@alamb
Copy link
Contributor

alamb commented Jun 26, 2025

Here is some parallel art from @zeroshade in the go implementation:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @harshmotw-db -- I took a quick look -- this is a great start. I'll look more carefully tomorrow

impl TryFrom<&Number> for Variant<'_, '_> {
type Error = ArrowError;

fn try_from(n: &Number) -> Result<Self, Self::Error> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up sounds like a good plan to me

mod metadata;
mod object;
use rust_decimal::prelude::*;
use serde_json::Number;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should be eventually be planning to avoid all use of serde_json for parsing / serializing.

It is fine to use serde_json for the initial version / getting things functionally working, but it is terribly inefficient compared to more optimized implementations like the tape decoder

Thus can we please try and keep anything serde_json specific out of Variant (like this From impl)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll move much of this code from variant.rs to from_json.rs if that's what you meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@harshmotw-db harshmotw-db requested review from alamb and scovich June 26, 2025 08:10
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A few small cleanups to consider before merge.

Comment on lines 52 to 53
let new_len = size.next_power_of_two();
self.value_buffer.resize(new_len, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to do this -- the underlying vec is guaranteed to have reasonable amortized allocation costs:

Vec does not guarantee any particular growth strategy when reallocating when full, nor when reserve is called. ... Whatever strategy is used will of course guarantee O(1) amortized push.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this for demonstration purposes so custom non-vec implementations allocate reasonably sized buffers so they don't encounter O(n^2) complexity. Maybe, we should request power of two from the library itself when we integrate this deep into construction.

Comment on lines 108 to 115
fn build_list(arr: &[Value], builder: &mut ListBuilder) -> Result<(), ArrowError> {
for val in arr {
append_json(val, builder)?;
}
Ok(())
}

fn build_object<'a, 'b>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two functions only have one caller each now -- should we just fold their 3-4 lines of code into their call sites?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed it to try_fold. Lmk if you meant something else.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry, I just meant to get rid of the functions and move their code directly where the function used to be called from.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@harshmotw-db harshmotw-db requested a review from alamb July 2, 2025 21:30
@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Jul 2, 2025

@alamb I have now removed all dependency on the custom serde_json and ignored all decimal tests. See if the PR can be merged now.

@scovich
Copy link
Contributor

scovich commented Jul 2, 2025

Just to clarify, the number parsing was never a hard problem... But I guess the tape decoder PR implements all of JSON parsing internally so that's great.

The tape decoder presents JSON numeric values as strings. But they still need to be parsed. Meanwhile, I had a little too much fun playing with parsing code (playground).

It's a lot more efficient than the "try it and see" approach my pathfinding PR took. Probably a little too fancy tho (code size could cause instruction cache and branch prediction problems that actually hurt performance).

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please make the lifetimes more readable?

.map_err(|e| ArrowError::InvalidArgumentError(format!("UTF-8 conversion error: {e}")))
}

fn variant_decimal_to_json_value(decimal: &impl VariantDecimal) -> Result<Value, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a macro could avoid both overhead and duplication? The only part that needs to change (besides the data types) is Decimal16 has a more complicated way of handling the integer value produced at the end.

JsonToVariantTest {
json: "0.79228162514264337593543950335",
expected: Variant::Double(0.792_281_625_142_643_3_f64),
expected: Variant::Double(0.792_281_625_142_643_4_f64),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what made this one change in the latest commit, out of curiosity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't investigate too much, but I suppose the arbitrary_precision flag was causing slightly different results in parsing doubles. I didn't care because doubles lose precision anyway.

Comment on lines 132 to 134
struct ObjectFieldBuilder<'a, 'b, 'c> {
key: &'a str,
builder: &'b mut ObjectBuilder<'c, 'a>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is hard to interpret... can we use 'm and 'v?

Suggested change
struct ObjectFieldBuilder<'a, 'b, 'c> {
key: &'a str,
builder: &'b mut ObjectBuilder<'c, 'a>,
struct ObjectFieldBuilder<'m, 'v, 'r> {
key: &'r str,
builder: &'r mut ObjectBuilder<'m, 'v>,

(here, 'r is the lifetime of the references used to construct the field builder)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to 's, 'o and 'v where 's is the lifetime of the 's is the lifetime of the string, 'o is the lifetime of [ObjectBuilder] and v is the lifetime of the variant buffers.

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

I merged up from main and fixed a clippy error to get a clean CI run

Given how long this PR has been outstanding and that it blocks a bunch of other work, I think we should try to merge it in and file follow ons asap. I am giving it another review now

}

#[test]
fn test_json_to_variant_object_very_large() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this single test currently takes over 22 seconds to complete

Screenshot 2025-07-03 at 2 11 16 PM

I will see if I can make it faster

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't able to make it faster with a small effort, but I did briefly look at some profiling and it is spending a very large amount of time validating the offsets in the variant

That is likely something we can improve on over time

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @scovich this PR is ready to go -- thank you for the work in this @harshmotw-db

I think the follow on items are:

  1. Support decimals better (use the smallest variant type like Int8 or Int64 is possible)
  2. Support round tripping of variants --> JSON --> Variants

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

@harshmotw-db can you please file a follow on ticket that describes the remaining work to support decimal in your mind?

mod builder;
mod from_json;
mod to_json;
#[allow(dead_code)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This allow was bothering me , so here is a PR to clean it up:

@alamb alamb merged commit 81ab147 into apache:main Jul 3, 2025
12 checks passed
@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

Thanks again @harshmotw-db and @scovich

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

Here is a follow on PR to break out the json functions:

@harshmotw-db
Copy link
Contributor Author

harshmotw-db commented Jul 3, 2025

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

Edit: I have prototyped this separately. I am just looking for a home for this function.

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

@alamb Do you think it is worth creating an arrow utility that abstracts away string to variant conversion? i.e. a function that takes in a StringArray (array of JSON strings some of which could be null), and returns a StructArray (array of corresponding variants where nulls are propagated directly)? If so, where should such functionality be in the codebase?

Yes I think that is an important function

I suggest calling it a kernel and putting it in a parquet-variant-compute crate

Perhaps an API like this:

fn to_variant(array: &ArrayRef) -> StructArray {
...
}

Also, adding this high-level functionality would also abstract away many of the changes we make to the lower-level library. If we make changes to VariantBuilder to require buffers from the caller, we could simply modify this high-level function and users' workflows would stay the same.

This is a great idea

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

I envision the reverse cast as well

fn variant_to_string<O: OffsetSize>(array: &ArrayRef) -> GenericStringArray<O> {
...
}

fn variant_to_string_view(array: &ArrayRef) -> StringViewArray {
...
}

@alamb
Copy link
Contributor

alamb commented Jul 3, 2025

(I recommend a new ticket to track this idea BTW)

@harshmotw-db
Copy link
Contributor Author

@alamb #7883
I am working on something to resolve this.

alamb added a commit that referenced this pull request Jul 8, 2025
# Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.

- Closes #7800
- Closes #7845
- Closes #7775

# Rationale for this change

Now that we have the basic json conversion functionality complete thanks
to @harshmotw-db ❤️ ❤️ ❤️ in
- #7783

I would like to move all the json related code into its own crate to
keep the functionality clean and clear

# What changes are included in this PR?

1. Move to_json and from_json into a new `parquet-variant-json` crate
where we can continue to iterate
2. Added new crate to CI jobs


# Are these changes tested?

CI
# Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Variant: Read/Parse JSON value as Variant

3 participants