Skip to content
Merged
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
81e3847
feat: Initial READ Api for Variant in parquet-variant crate
mkarbo May 21, 2025
9075dd4
chore: API refinements, questions for discussions
mkarbo May 21, 2025
1ebbad7
chore: comments & cleanup
mkarbo May 21, 2025
b3ecdd9
chore: typo
mkarbo May 21, 2025
9688fa4
feat: buffer access validations
mkarbo May 21, 2025
5705687
chore: remove useless API
mkarbo May 21, 2025
d80d1d8
Test out avoiding strum and Cow
alamb May 21, 2025
9aa5d3b
Merge pull request #1 from alamb/alamb/no_strum
mkarbo May 22, 2025
696fa9b
chore: comments
mkarbo May 22, 2025
798b627
chore: rename total to dict_len for consistency
mkarbo May 22, 2025
8bea9b4
chore: remove unnecessary masks
mkarbo May 22, 2025
2e262f1
chore: Start work on PR comments, validating header, refactor
mkarbo May 22, 2025
715081b
chore: minor cleanups and utils
mkarbo May 22, 2025
b774b38
chore: remove redundant comments
mkarbo May 22, 2025
5c668b4
chore: fix int_8 to use array_from_slice
mkarbo May 22, 2025
0ceaedc
chore: revert slice_from_slice due to saturation issues at max usize
mkarbo May 22, 2025
130e113
chore: comment on memoize with TODO
mkarbo May 22, 2025
45a71c6
chore: clean-up and fix some things in the variant API
mkarbo May 22, 2025
032a9d6
chore: Iterators, and fix get_by methods and some cleanups
mkarbo May 22, 2025
fe2bf45
chore: 3rd branch of unpack, docs + some tests
mkarbo May 22, 2025
10fe6bf
chore: cleanup old comments
mkarbo May 22, 2025
992a35d
chore: use unreachable!() and add comment on why it's sound
mkarbo May 22, 2025
c2c6bfd
chore: remove note per PR feedback
mkarbo May 22, 2025
695f49d
feat: impl TryFrom for VariantPrimitiveType
mkarbo May 22, 2025
fd055ef
chore: cleanup iterators
mkarbo May 23, 2025
be0a001
Update parquet-variant/src/variant.rs
mkarbo May 23, 2025
ed77dbf
chore: cleanup lifetimes
mkarbo May 23, 2025
6549631
Update parquet-variant/src/variant.rs
mkarbo May 23, 2025
dafc028
Apply suggestions from code review
mkarbo May 23, 2025
88137b0
chore(refactor): non_empty_byte util -> first_byte_from_slice
mkarbo May 23, 2025
5b4115b
chore(validation): validate version in try_new
mkarbo May 23, 2025
362341e
chore(validation): Add final checks to try_new for VariantMetadata
mkarbo May 23, 2025
6946299
feat(test): tests for try_new
mkarbo May 23, 2025
d3426ab
chore: memoize dictionary byte offset start
mkarbo May 23, 2025
c507efb
chore(refactor): simpliy str from slice to util (feedback)
mkarbo May 23, 2025
7d2b06d
chore(docstring): forgot to update docstring
mkarbo May 23, 2025
a9354c4
chore: minor comment + clearer offset api
mkarbo May 23, 2025
a68828b
chore: test fields iterator
mkarbo May 23, 2025
f4b8f58
feat: initial get method for array (will refactor)
mkarbo May 23, 2025
1358eed
Merge remote-tracking branch 'apache/main' into mkar/7423-read-varian…
alamb May 23, 2025
1273a6d
Add header to parquet files
alamb May 23, 2025
3fad3eb
chore: various PR comments
mkarbo May 30, 2025
e203359
chore: PR feedback and better monotone check based on scovich
mkarbo May 30, 2025
026774d
chore: metadata return struct not bytes
mkarbo May 30, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions parquet-variant/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ edition = { workspace = true }
rust-version = { workspace = true }

[dependencies]
arrow-schema = "55.1.0"

[lib]

Expand Down
158 changes: 158 additions & 0 deletions parquet-variant/src/decoder.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
use arrow_schema::ArrowError;
use std::array::TryFromSliceError;

use crate::utils::{array_from_slice, first_byte_from_slice, slice_from_slice, string_from_slice};

#[derive(Debug, Clone, Copy)]
pub enum VariantBasicType {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem pretty basic, should they be in the top-level lib.rs instead of here in decoder.rs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, let's move them in a follow on PR

Primitive = 0,
ShortString = 1,
Object = 2,
Array = 3,
}

#[derive(Debug, Clone, Copy)]
pub enum VariantPrimitiveType {
Null = 0,
BooleanTrue = 1,
BooleanFalse = 2,
Int8 = 3,
// TODO: Add types for the rest of primitives, once API is agreed upon
String = 16,
}

/// Extracts the basic type from a header byte
pub(crate) fn get_basic_type(header: u8) -> Result<VariantBasicType, ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Do we ever expect to return Err? Seems infallible?

Also, out of curiosity, why is one a function and the other an impl [Try]From with a function as well?

// See https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#value-encoding
let basic_type = header & 0x03; // Basic type is encoded in the first 2 bits
let basic_type = match basic_type {
0 => VariantBasicType::Primitive,
1 => VariantBasicType::ShortString,
2 => VariantBasicType::Object,
3 => VariantBasicType::Array,
_ => {
//NOTE: A 2-bit value has a max of 4 different values (0-3), hence this is unreachable as we
// masked `basic_type` with 0x03 above.
unreachable!();
}
};
Ok(basic_type)
}

impl TryFrom<u8> for VariantPrimitiveType {
type Error = ArrowError;

fn try_from(value: u8) -> Result<Self, Self::Error> {
match value {
0 => Ok(VariantPrimitiveType::Null),
1 => Ok(VariantPrimitiveType::BooleanTrue),
2 => Ok(VariantPrimitiveType::BooleanFalse),
3 => Ok(VariantPrimitiveType::Int8),
// TODO: Add types for the rest, once API is agreed upon
16 => Ok(VariantPrimitiveType::String),
_ => Err(ArrowError::InvalidArgumentError(format!(
"unknown primitive type: {}",
value
))),
}
}
}
/// Extract the primitive type from a Variant value-header byte
pub(crate) fn get_primitive_type(header: u8) -> Result<VariantPrimitiveType, ArrowError> {
// last 6 bits contain the primitive-type, see spec
VariantPrimitiveType::try_from(header >> 2)
}

/// To be used in `map_err` when unpacking an integer from a slice of bytes.
fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError {
ArrowError::InvalidArgumentError(e.to_string())
}
Comment on lines +82 to +85
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not impl From so ? Just Works? Is it because trait definitions are global and we don't want this to implicitly become part of the public API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better actually to improve this conversion a bit to explain the error comes from trying to parse variants. If we just add a blanket From impl the code will just work, but users will see messages like Invalid slice which will be pretty unspecific


/// Decodes an Int8 from the value section of a variant.
pub(crate) fn decode_int8(value: &[u8]) -> Result<i8, ArrowError> {
let value = i8::from_le_bytes(array_from_slice(value, 1)?);
Ok(value)
}

/// Decodes a long string from the value section of a variant.
pub(crate) fn decode_long_string(value: &[u8]) -> Result<&str, ArrowError> {
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
let len = u32::from_le_bytes(array_from_slice(value, 1)?) as usize;
let len = OffsetSizeBytes::Four.unpack_usize(value, 1, 0)?;

let string = string_from_slice(value, 5..5 + len)?;
Ok(string)
}

/// Decodes a short string from the value section of a variant.
pub(crate) fn decode_short_string(value: &[u8]) -> Result<&str, ArrowError> {
let len = (first_byte_from_slice(value)? >> 2) as usize;

let string = string_from_slice(value, 1..1 + len)?;
Comment on lines +102 to +104
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
let len = (first_byte_from_slice(value)? >> 2) as usize;
let string = string_from_slice(value, 1..1 + len)?;
let len = (first_byte_from_slice(value)? >> 2) as usize;
let string = string_from_slice(value, 1..1 + len)?;

Ok(string)
}

#[cfg(test)]
mod tests {
use super::*;

#[test]
fn test_i8() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To really write tests like this it will be nice / easier if we have the variant builder API that @PinkCrow007 is working on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me write these out a little more. I think having some hand-crafted tests (e.g., also for object) where we see how the bytes look will be helpful for readers as documentation, not just in validating. We can add test variants using both the builder & manual process.

I personally learnt a lot by writing out the details with pen and paper for the metadata and value bytes for various variant data such as 42, <short string>, <longer string> {'a': 1, 'b': 2, 'c': <uuid>} and {'a':1, 'b': 'hello 💯', 'c': {'x': 99, 'y': true, 'z': <uuid>}} and I think that could help future maintainers get onboarded (would update the tests with more in-depth docs)

However, we can also skip them if you guys prefer, lmk.

let value = [
0 | 3 << 2, // Primitive type for i8
42,
];
let result = decode_int8(&value)?;
assert_eq!(result, 42);
Ok(())
}

#[test]
fn test_short_string() -> Result<(), ArrowError> {
let value = [
1 | 5 << 2, // Basic type for short string | length of short string
'H' as u8,
'e' as u8,
'l' as u8,
'l' as u8,
'o' as u8,
'o' as u8,
];
let result = decode_short_string(&value)?;
assert_eq!(result, "Hello");
Ok(())
}

#[test]
fn test_string() -> Result<(), ArrowError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I don't think these unit tests are necessary given we have code that is reading the vaues from parquet-testing

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let me write these out a little more. I think having some hand-crafted tests (e.g., also for object) where we see how the bytes look will be helpful for readers as documentation, not just in validating. We can add test variants using both the builder & manual process.

I personally learnt a lot by writing out the details with pen and paper for the metadata and value bytes for various variant data such as 42, <short string>, <longer string> {'a': 1, 'b': 2, 'c': <uuid>} and {'a':1, 'b': 'hello 💯', 'c': {'x': 99, 'y': true, 'z': <uuid>}} and I think that could help future maintainers get onboarded (would update the tests with more in-depth docs)

However, we can also skip them if you guys prefer, lmk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are ok, maybe we just shouldn't go too crazy.

The other thing we could do is to make them examples in the documentation -- which would help both users and maintainers.

let value = [
0 | 16 << 2, // Basic type for short string | length of short string
5,
0,
0,
0, // Length of string
'H' as u8,
'e' as u8,
'l' as u8,
'l' as u8,
'o' as u8,
'o' as u8,
];
let result = decode_long_string(&value)?;
assert_eq!(result, "Hello");
Ok(())
}
}
13 changes: 13 additions & 0 deletions parquet-variant/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,16 @@
//! If you are interested in helping, you can find more information on the GitHub [Variant issue]
//!
//! [Variant issue]: https://github.com/apache/arrow-rs/issues/6736

// TODO: dead code removal
#[allow(dead_code)]
mod decoder;
// TODO: dead code removal
#[allow(dead_code)]
mod variant;
// TODO: dead code removal
#[allow(dead_code)]
mod utils;

#[cfg(test)]
mod test_variant;
100 changes: 100 additions & 0 deletions parquet-variant/src/test_variant.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.

//! End-to-end check: (almost) every sample from apache/parquet-testing/variant
//! can be parsed into our `Variant`.

// NOTE: We keep this file separate rather than a test mod inside variant.rs because it should be
// moved to the test folder later
use std::fs;
use std::path::{Path, PathBuf};

use crate::variant::{Variant, VariantMetadata};
use arrow_schema::ArrowError;

fn cases_dir() -> PathBuf {
Path::new(env!("CARGO_MANIFEST_DIR"))
.join("..")
.join("parquet-testing")
.join("variant")
}

fn load_case(name: &str) -> Result<(Vec<u8>, Vec<u8>), ArrowError> {
let root = cases_dir();
let meta = fs::read(root.join(format!("{name}.metadata")))?;
let val = fs::read(root.join(format!("{name}.value")))?;
Ok((meta, val))
}

fn get_primitive_cases() -> Vec<(&'static str, Variant<'static, 'static>)> {
vec![
("primitive_boolean_false", Variant::BooleanFalse),
("primitive_boolean_true", Variant::BooleanTrue),
("primitive_int8", Variant::Int8(42)),
// Using the From<String> trait
("primitive_string", Variant::from("This string is longer than 64 bytes and therefore does not fit in a short_string and it also includes several non ascii characters such as 🐢, 💖, ♥\u{fe0f}, 🎣 and 🤦!!")),
// Using the From<String> trait
("short_string", Variant::from("Less than 64 bytes (❤\u{fe0f} with utf8)")),
// TODO Reenable when https://github.com/apache/parquet-testing/issues/81 is fixed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ("primitive_null", Variant::Null),
]
}

fn get_non_primitive_cases() -> Vec<&'static str> {
vec!["object_primitive", "array_primitive"]
}

#[test]
fn variant_primitive() -> Result<(), ArrowError> {
let cases = get_primitive_cases();
for (case, want) in cases {
let (metadata_bytes, value) = load_case(case)?;
let metadata = VariantMetadata::try_new(&metadata_bytes)?;
let got = Variant::try_new(&metadata, &value)?;
assert_eq!(got, want);
}
Ok(())
}

#[test]
fn variant_non_primitive() -> Result<(), ArrowError> {
let cases = get_non_primitive_cases();
for case in cases {
let (metadata, value) = load_case(case)?;
let metadata = VariantMetadata::try_new(&metadata)?;
let variant = Variant::try_new(&metadata, &value)?;
match case {
"object_primitive" => {
assert!(matches!(variant, Variant::Object(_)));
assert_eq!(metadata.dictionary_size(), 7);
let dict_val = metadata.get_field_by(0)?;
assert_eq!(dict_val, "int_field");
}
"array_primitive" => match variant {
Variant::Array(arr) => {
let v = arr.get(0)?;
assert!(matches!(v, Variant::Int8(2)));
let v = arr.get(1)?;
assert!(matches!(v, Variant::Int8(1)));
}
_ => panic!("expected an array"),
},
_ => unreachable!(),
}
}
Ok(())
}
55 changes: 55 additions & 0 deletions parquet-variant/src/utils.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
// Licensed to the Apache Software Foundation (ASF) under one
// or more contributor license agreements. See the NOTICE file
// distributed with this work for additional information
// regarding copyright ownership. The ASF licenses this file
// to you under the Apache License, Version 2.0 (the
// "License"); you may not use this file except in compliance
// with the License. You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing,
// software distributed under the License is distributed on an
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, either express or implied. See the License for the
// specific language governing permissions and limitations
// under the License.
use std::{array::TryFromSliceError, ops::Range, str};

use arrow_schema::ArrowError;

#[inline]
pub(crate) fn slice_from_slice(bytes: &[u8], range: Range<usize>) -> Result<&[u8], ArrowError> {
bytes.get(range.clone()).ok_or_else(|| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: Strange that Range lacks an impl<Idx: Copy> Copy for Range<Idx>

ArrowError::InvalidArgumentError(format!(
"Tried to extract {} bytes at offset {} from {}-byte buffer",
range.end - range.start,
range.start,
bytes.len(),
))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The start and end usually come from variant data. If the caller didn't validate them, we could have received a start larger than end which would cause the slice::get call to return None and subsequently cause a panic here due to integer underflow. We should probably do range.end.checked_sub(range.start) here, and return a different error in case of invalid range?

Suggested change
bytes.get(range.clone()).ok_or_else(|| {
ArrowError::InvalidArgumentError(format!(
"Tried to extract {} bytes at offset {} from {}-byte buffer",
range.end - range.start,
range.start,
bytes.len(),
))
bytes.get(range.clone()).ok_or_else(|| match range.end.checked_sub(range.start) {
Some(len) => ArrowError::InvalidArgumentError(format!(
"Tried to extract {len} bytes at offset {} from {}-byte buffer",
range.start,
bytes.len(),
)),
None => ArrowError::InvalidArgumentError(format!(
"Invalid range: {}..{}", range.start, range.end,
)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: If we go for the SliceIndex approach that just prints {index:?} in the error message, we no longer need to worry about overflow.

})
}
pub(crate) fn array_from_slice<const N: usize>(
bytes: &[u8],
offset: usize,
) -> Result<[u8; N], ArrowError> {
let bytes = slice_from_slice(bytes, offset..offset + N)?;
bytes.try_into().map_err(map_try_from_slice_error)
}

/// To be used in `map_err` when unpacking an integer from a slice of bytes.
pub(crate) fn map_try_from_slice_error(e: TryFromSliceError) -> ArrowError {
ArrowError::InvalidArgumentError(e.to_string())
}

pub(crate) fn first_byte_from_slice(slice: &[u8]) -> Result<&u8, ArrowError> {
slice
.get(0)
.ok_or_else(|| ArrowError::InvalidArgumentError("Received empty bytes".to_string()))
}

/// Helper to get a &str from a slice based on range, if it's valid or an error otherwise
pub(crate) fn string_from_slice(slice: &[u8], range: Range<usize>) -> Result<&str, ArrowError> {
str::from_utf8(slice_from_slice(slice, range)?)
.map_err(|_| ArrowError::InvalidArgumentError("invalid UTF-8 string".to_string()))
}
Loading
Loading