Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This PR adds a feature to Variant::ObjectBuilder that enables constructing nested objects and objects with lists.

Are there any user-facing changes?

Adds two public methods to the ObjectBuilder API, new_list and new_object

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 25, 2025
@friendlymatthew
Copy link
Contributor Author

This is the second half of the work required to fully support nested object and list building. I'd appreciate your thoughts!

After this, I'll follow up with PRs that address @scovich's comments in #7740.

cc @alamb @scovich

@friendlymatthew friendlymatthew changed the title [Variant] Support nested object and object with list building [Variant] Support nested objects and object with lists Jun 25, 2025
@alamb alamb changed the title [Variant] Support nested objects and object with lists [Variant] Support creating nested objects and object with lists Jun 25, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @friendlymatthew -- I went through this carefully and I think it looks great

As you might expect by now I had some suggestions, and I coded them up as another PR for your consideration:

inner_object_builder.finish();
}

outer_object_builder.finish();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at these examples, it seems to me like the current API requires callingfinish -- it seems like it might be a nicer API if that just happened on drop() -- so you could use the scope of the example, like you have here, and the dropping would just complete the object

This would be a great follow on PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this PR basically follows the existing pattern I am going to merge it in to keep the code flowing;

@PinkCrow007 @scovich @mkarbo @carpecodeum or @Weijun-H -- please feel free to review this PR and comment (or make a PR) for anything else you might find / see could be improved

@alamb alamb merged commit 7d3a25a into apache:main Jun 25, 2025
12 checks passed
@harshmotw-db
Copy link
Contributor

@friendlymatthew Why are objects and lists treated as "pending" and not eagerly added? I am adding a function to parse JSON into Variant, and because of this, I am seeing a behavioral difference between Spark's parse_json expression and the rust implementation

@friendlymatthew
Copy link
Contributor Author

@friendlymatthew Why are objects and lists treated as "pending" and not eagerly added? I am adding a function to parse JSON into Variant, and because of this, I am seeing a behavioral difference between Spark's parse_json expression and the rust implementation

Hi, by eagerly added do you mean to the parent builder? We can't add an object or list to the parent builder's buffer until the object/list itself has been fully created because each nested object/list has its own header and we can't know what the header will be.

What sort of behavioral differences are you seeing? If you call .finish() before operating on the builder, the object should be exactly the same as if it were to be done eagerly.

@harshmotw-db
Copy link
Contributor

harshmotw-db commented Jun 25, 2025

The difference I am seeing is that when creating a Variant out of the JSON string {"numbers": [4, -3e0, 1.001], "null": null, "booleans": [true, false]}, I am seeing "null" get an ID of 0 and "numbers" get an ID of 1. In Spark, I would expect "numbers" to get an ID of zero since it is the first key.

Adding self.check_pending_field() to the beginning of insert(...) fixes this. Is the current behavior intentional?

This is not a logical issue by any means though.

@friendlymatthew
Copy link
Contributor Author

The difference I am seeing is that when creating a Variant out of the JSON string {"numbers": [4, -3e0, 1.001], "null": null, "booleans": [true, false]}, I am seeing "null" get an ID of 0 and "numbers" get an ID of 1. In Spark, I would expect "numbers" to get an ID of zero since it is the first key.

Adding self.check_pending_field() to the beginning of insert(...) fixes this. Is the current behavior intentional?

This is not a logical issue by any means though.

Good catch! I just put up a PR that should fix the issue above. #7786

alamb pushed a commit that referenced this pull request Jun 26, 2025
# Rationale for this change

A follow up from #7778, we should
make sure to check for pending fields before calling
`ObjectBuilder::insert`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support Nested Data in VariantBuilder

3 participants