Skip to content
Merged
Show file tree
Hide file tree
Changes from 76 commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
95f3b03
wip: errors, but compiles
balbok0 Dec 19, 2024
48ae672
works!
balbok0 Dec 20, 2024
9c4ec12
fix linting issues in dtype
balbok0 Dec 20, 2024
88e789e
Support multi-dimensional arrays
balbok0 Dec 25, 2024
e73c97c
remove commented out code
balbok0 Dec 25, 2024
432b563
lint python
balbok0 Dec 25, 2024
12ddfb9
clippy
balbok0 Dec 25, 2024
acff78e
rust lint
balbok0 Dec 25, 2024
e9d7fe6
ruff again
balbok0 Dec 25, 2024
820e44b
more clippy/ruff
balbok0 Dec 25, 2024
ead6777
Merge remote-tracking branch 'balbok0/add-binary-as-numerical-array' …
pythonspeed May 20, 2025
21365e2
Reformat.
pythonspeed May 20, 2025
f6babe7
Correct to match actual behavior.
pythonspeed May 20, 2025
643bc8d
If size is wrong, return null.
pythonspeed May 20, 2025
efdc01f
Bit packed, so not clear what it means.
pythonspeed May 20, 2025
3b44de8
Multidimensional arrays now preserve validity.
pythonspeed May 21, 2025
ca00660
More accurate names
pythonspeed May 21, 2025
a43dca4
More tests.
pythonspeed May 21, 2025
d8a5cc2
Get rid of todo!.
pythonspeed May 21, 2025
1d56bd3
Start with capacity.
pythonspeed May 21, 2025
4a9bf98
Delete confusing comment.
pythonspeed May 21, 2025
13ee4a4
Better handling and testing for things we don't support.
pythonspeed May 21, 2025
d5037f1
Lint fix.
pythonspeed May 21, 2025
3301451
Fix typo
pythonspeed May 21, 2025
81628bc
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed May 21, 2025
9a94d4f
Conditional compilation.
pythonspeed May 21, 2025
438169a
Simplify
pythonspeed May 21, 2025
d142b65
Conditional compilation.
pythonspeed May 21, 2025
0c08b17
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed May 23, 2025
03fe56d
Update py-polars/polars/series/binary.py
itamarst Jun 4, 2025
60e9b9b
Only support one-dimensional Array.
pythonspeed Jun 4, 2025
ce5a9b3
Some cleanups.
pythonspeed Jun 4, 2025
2a7a588
More testing
pythonspeed Jun 4, 2025
99839b9
Mention destination dtype
pythonspeed Jun 4, 2025
18e456f
Optimization: don't need allocations
pythonspeed Jun 4, 2025
00d7cf2
Fix docstring
pythonspeed Jun 4, 2025
e267b72
Pacify mypy
pythonspeed Jun 4, 2025
7409bd4
Get rid of byte_size().
pythonspeed Jun 5, 2025
78216c7
Bit more testing.
pythonspeed Jun 5, 2025
dc68c2d
remove get_shape() allocation
nameexhaustion Jun 6, 2025
616eb3f
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jun 25, 2025
05b3817
Add type checking on conversion to IR.
pythonspeed Jun 30, 2025
9aae0d0
Minimize the error checking, since user facing was done already
pythonspeed Jun 30, 2025
4cbfbf6
Nit from review comment
pythonspeed Jun 30, 2025
b3c7cd8
Docs
pythonspeed Jun 30, 2025
741cffa
Document the issues with a potential fast path.
pythonspeed Jun 30, 2025
a108baf
Better name.
pythonspeed Jun 30, 2025
2067a17
Support reinterpreting to non-primitive types that map to primitive n…
pythonspeed Jun 30, 2025
842802f
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jul 2, 2025
f9da0e8
Rename to match new name.
pythonspeed Jul 2, 2025
e2278b2
Pacify mypy
pythonspeed Jul 3, 2025
d860f1f
Simplify, and improve the error message
pythonspeed Jul 3, 2025
66d0f31
Reformat
pythonspeed Jul 3, 2025
865b2b5
Document the change.
pythonspeed Jul 3, 2025
138c49c
Correct verb
pythonspeed Jul 3, 2025
006b79f
Document in both locations and try to pacify ruff.
pythonspeed Jul 3, 2025
74fe457
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jul 3, 2025
a77a001
Fix cargo fmt
pythonspeed Jul 3, 2025
502cc66
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jul 7, 2025
5c9cc5f
Change the argument so there is less duplication.
pythonspeed Jul 7, 2025
d983de2
Switch to a Vec-based implementation.
pythonspeed Jul 7, 2025
760864c
Some optimizations, hopefully.
pythonspeed Jul 7, 2025
9d34386
Tweak from clippy
pythonspeed Jul 7, 2025
e7c8d80
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jul 8, 2025
cabd0f2
Merge remote-tracking branch 'origin/main' into add-binary-as-numeric…
pythonspeed Jul 8, 2025
b2b51fb
Get rid of function pointers (even if the compiler optimizes them away).
pythonspeed Jul 8, 2025
c64280d
Optimized fast path for little endian data on little endian CPUs.
pythonspeed Jul 8, 2025
989c351
Another optimization.
pythonspeed Jul 9, 2025
766872a
Less data type conversion.
pythonspeed Jul 9, 2025
1815625
Rename the variable.
pythonspeed Jul 10, 2025
88ff748
More informative error.
pythonspeed Jul 10, 2025
0d37de8
Match type-checking logic in the underlying operation, more accurate …
pythonspeed Jul 10, 2025
0994bd5
Reformat.
pythonspeed Jul 10, 2025
586817c
Fix typo.
pythonspeed Jul 10, 2025
3f3ad44
Ruff doesn't link perfectly reasonable unicode characters.
pythonspeed Jul 10, 2025
eaba025
Make sure the sizes aren't too big.
pythonspeed Jul 11, 2025
1b32a87
Don't limit to IdxSize.
pythonspeed Jul 14, 2025
c6572fc
Better name.
pythonspeed Jul 14, 2025
1db40b0
More safety checks, remove possibility of uninitialized memory.
pythonspeed Jul 14, 2025
482858a
Handle edge case where the length of binary data is zero.
pythonspeed Jul 14, 2025
596209e
Another test.
pythonspeed Jul 14, 2025
244aae3
Pacify clippy.
pythonspeed Jul 14, 2025
7781235
Slightly clearer assert.
pythonspeed Jul 14, 2025
8377805
Minor cleanups.
pythonspeed Jul 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 142 additions & 2 deletions crates/polars-compute/src/cast/binview_to.rs
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
use std::ptr::copy_nonoverlapping;

use arrow::array::*;
use arrow::bitmap::MutableBitmap;
#[cfg(feature = "dtype-decimal")]
use arrow::compute::decimal::deserialize_decimal;
use arrow::datatypes::{ArrowDataType, TimeUnit};
use arrow::datatypes::{ArrowDataType, Field, TimeUnit};
use arrow::offset::Offset;
use arrow::types::NativeType;
use chrono::Datelike;
use num_traits::FromBytes;
use polars_error::PolarsResult;
use polars_error::{PolarsResult, polars_err};
use polars_utils::IdxSize;

use super::CastOptionsImpl;
use super::binary_to::Parse;
Expand Down Expand Up @@ -191,3 +195,139 @@ where
is_little_endian,
)))
}

/// Casts a [`BinaryViewArray`] to a [`FixedSizeListArray`], making any un-castable value a Null.
///
/// # Arguments
///
/// * `from`: The array to reinterpret.
/// * `array_width`: The number of items in each `Array`.
pub(super) fn try_binview_to_fixed_size_list<T, const IS_LITTLE_ENDIAN: bool>(
from: &BinaryViewArray,
array_width: usize,
) -> PolarsResult<FixedSizeListArray>
where
T: FromBytes + NativeType,
for<'a> &'a <T as FromBytes>::Bytes: TryFrom<&'a [u8]>,
{
let element_size = std::mem::size_of::<T>();
// The maximum number of primitives in the result:
let primitive_length = (from.len() as IdxSize)
Comment thread
itamarst marked this conversation as resolved.
Outdated
.checked_mul(array_width as IdxSize)
.ok_or_else(|| {
polars_err!(
InvalidOperation:
"array chunk length * number of items ({} * {}) is too large",
from.len(),
array_width
)
})? as usize;
// The size of each array, in bytes:
let array_bytes_size = (element_size as IdxSize)
.checked_mul(array_width as IdxSize)
.ok_or_else(|| {
polars_err!(
InvalidOperation:
"array size in bytes ({} * {}) is too large",
element_size,
array_width
)
})? as usize;
let mut out: Vec<T> = Vec::with_capacity(primitive_length);
let mut validity = MutableBitmap::from_len_set(from.len());

for (index, value) in from.iter().enumerate() {
if let Some(value) = value
&& value.len() == array_bytes_size
{
if cfg!(target_endian = "little") && IS_LITTLE_ENDIAN {
// Fast path, we can just copy the data with no need to
// reinterpret.
let write_index = array_width * index;
debug_assert!(write_index < primitive_length);
debug_assert!((write_index + (array_width - 1)) < primitive_length);
// # Safety
// - The target index is smaller than the vector's pre-allocated
// capacity.
// - We made sure `value` has byte length
// `array_width * element_size`.
unsafe {
copy_nonoverlapping(
value.as_ptr(),
out.as_mut_ptr().add(write_index) as *mut u8,
Comment thread
itamarst marked this conversation as resolved.
Outdated
value.len(),
);
}
} else {
// Slow path, reinterpret items one by one.
for j in 0..array_width {
let jth_range = (j * element_size)..((j + 1) * element_size);
debug_assert!(value.get(jth_range.clone()).is_some());
// # Safety
// We made sure the range is smaller than `value` length.
let jth_bytes = unsafe { value.get_unchecked(jth_range) };
// # Safety
// We just made sure that the slice has length `element_size`
let byte_array = unsafe { jth_bytes.try_into().unwrap_unchecked() };
let jth_value = if IS_LITTLE_ENDIAN {
<T as FromBytes>::from_le_bytes(byte_array)
} else {
<T as FromBytes>::from_be_bytes(byte_array)
};

let write_index = array_width * index + j;
debug_assert!(write_index < primitive_length);
// # Safety
// - The target index is smaller than the vector's pre-allocated capacity.
unsafe {
std::ptr::write(out.as_mut_ptr().add(write_index), jth_value);
Comment thread
itamarst marked this conversation as resolved.
Outdated
}
}
}
} else {
validity.set(index, false);
};
}

// # Safety
// `out` was created with capacity primitive_length.
unsafe { out.set_len(primitive_length) };

FixedSizeListArray::try_new(
ArrowDataType::FixedSizeList(
Box::new(Field::new("".into(), T::PRIMITIVE.into(), true)),
array_width,
),
from.len(),
Box::new(PrimitiveArray::<T>::from_vec(out)),
validity.into(),
)
}

/// Casts a `dyn` [`Array`] to a [`FixedSizeListArray`], making any un-castable value a Null.
///
/// # Arguments
///
/// * `from`: The array to reinterpret.
/// * `array_width`: The number of items in each `Array`.
///
/// # Panics
/// Panics if `from` is not `BinaryViewArray`.
pub fn binview_to_fixed_size_list_dyn<T>(
from: &dyn Array,
array_width: usize,
is_little_endian: bool,
) -> PolarsResult<Box<dyn Array>>
where
T: FromBytes + NativeType,
for<'a> &'a <T as FromBytes>::Bytes: TryFrom<&'a [u8]>,
{
let from = from.as_any().downcast_ref().unwrap();

let result = if is_little_endian {
try_binview_to_fixed_size_list::<T, true>(from, array_width)
} else {
try_binview_to_fixed_size_list::<T, false>(from, array_width)
}?;
Ok(Box::new(result))
}
2 changes: 1 addition & 1 deletion crates/polars-compute/src/cast/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,11 @@ use arrow::array::*;
use arrow::datatypes::*;
use arrow::match_integer_type;
use arrow::offset::{Offset, Offsets};
pub use binview_to::binview_to_primitive_dyn;
use binview_to::{
binview_to_dictionary, utf8view_to_date32_dyn, utf8view_to_dictionary,
utf8view_to_naive_timestamp_dyn, view_to_binary,
};
pub use binview_to::{binview_to_fixed_size_list_dyn, binview_to_primitive_dyn};
use dictionary_to::*;
use polars_error::{PolarsResult, polars_bail, polars_ensure, polars_err};
use polars_utils::IdxSize;
Expand Down
71 changes: 53 additions & 18 deletions crates/polars-ops/src/chunked_array/binary/namespace.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
#[cfg(feature = "binary_encoding")]
use std::borrow::Cow;

use arrow::with_match_primitive_type;
#[cfg(feature = "binary_encoding")]
use arrow::array::Array;
#[cfg(feature = "binary_encoding")]
use base64::Engine as _;
#[cfg(feature = "binary_encoding")]
use base64::engine::general_purpose;
use memchr::memmem::find;
use polars_compute::cast::binview_to_primitive_dyn;
use polars_compute::cast::{binview_to_fixed_size_list_dyn, binview_to_primitive_dyn};
use polars_compute::size::binary_size_bytes;
use polars_core::prelude::arity::{broadcast_binary_elementwise_values, unary_elementwise_values};

Expand Down Expand Up @@ -156,29 +157,63 @@ pub trait BinaryNameSpaceImpl: AsBinary {

#[cfg(feature = "binary_encoding")]
fn reinterpret(&self, dtype: &DataType, is_little_endian: bool) -> PolarsResult<Series> {
unsafe {
Ok(Series::from_chunks_and_dtype_unchecked(
self.as_binary().name().clone(),
self._reinterpret_inner(dtype, is_little_endian)?,
dtype,
))
}
}

#[cfg(feature = "binary_encoding")]
fn _reinterpret_inner(
&self,
dtype: &DataType,
is_little_endian: bool,
) -> PolarsResult<Vec<Box<dyn Array>>> {
use polars_core::with_match_physical_numeric_polars_type;

let ca = self.as_binary();
let arrow_type = dtype.to_arrow(CompatLevel::newest());

match arrow_type.to_physical_type() {
arrow::datatypes::PhysicalType::Primitive(ty) => {
with_match_primitive_type!(ty, |$T| {
match dtype {
dtype if dtype.is_primitive_numeric() || dtype.is_temporal() => {
let dtype = dtype.to_physical();
let arrow_data_type = dtype
.to_arrow(CompatLevel::newest())
.underlying_physical_type();
with_match_physical_numeric_polars_type!(dtype, |$T| {
unsafe {
Ok(Series::from_chunks_and_dtype_unchecked(
ca.name().clone(),
ca.chunks().iter().map(|chunk| {
binview_to_primitive_dyn::<$T>(
&**chunk,
&arrow_type,
is_little_endian,
)
}).collect::<PolarsResult<Vec<_>>>()?,
dtype
))
ca.chunks().iter().map(|chunk| {
binview_to_primitive_dyn::<<$T as PolarsNumericType>::Native>(
&**chunk,
&arrow_data_type,
is_little_endian,
)
}).collect()
}
})
},
#[cfg(feature = "dtype-array")]
DataType::Array(inner_dtype, array_width)
if inner_dtype.is_primitive_numeric() || inner_dtype.is_temporal() =>
{
let inner_dtype = inner_dtype.to_physical();
let result: Vec<ArrayRef> = with_match_physical_numeric_polars_type!(inner_dtype, |$T| {
unsafe {
ca.chunks().iter().map(|chunk| {
binview_to_fixed_size_list_dyn::<<$T as PolarsNumericType>::Native>(
&**chunk,
*array_width,
is_little_endian
)
}).collect::<Result<Vec<ArrayRef>, _>>()
}
})?;
Ok(result)
},
_ => Err(
polars_err!(InvalidOperation:"unsupported data type in reinterpret. Only numerical types are allowed."),
polars_err!(InvalidOperation: "unsupported data type {:?} in reinterpret. Only numeric or temporal types, or Arrays of those, are allowed.", dtype),
),
}
}
Expand Down
2 changes: 2 additions & 0 deletions crates/polars-plan/src/dsl/function_expr/binary.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ pub enum BinaryFunction {
Base64Encode,
Size,
#[cfg(feature = "binary_encoding")]
/// The parameters are destination type, and whether to use little endian
/// encoding.
Reinterpret(DataTypeExpr, bool),
}

Expand Down
15 changes: 13 additions & 2 deletions crates/polars-plan/src/plans/conversion/dsl_to_ir/functions.rs
Original file line number Diff line number Diff line change
Expand Up @@ -112,8 +112,19 @@ pub(super) fn convert_functions(
B::Base64Encode => IB::Base64Encode,
B::Size => IB::Size,
#[cfg(feature = "binary_encoding")]
B::Reinterpret(data_type, v) => {
IB::Reinterpret(data_type.into_datatype(ctx.schema)?, v)
B::Reinterpret(dtype_expr, v) => {
let dtype = dtype_expr.into_datatype(ctx.schema)?;
let can_reinterpret_to =
|dt: &DataType| dt.is_primitive_numeric() || dt.is_temporal();
polars_ensure!(
can_reinterpret_to(&dtype) || (
dtype.is_array() && dtype.inner_dtype().map(can_reinterpret_to) == Some(true)
),
InvalidOperation:
"cannot reinterpret binary to dtype {:?}. Only numeric or temporal dtype, or Arrays of these, are supported. Hint: To reinterpret to a nested Array, first reinterpret to a linear Array, and then use reshape",
dtype
);
IB::Reinterpret(dtype, v)
},
})
},
Expand Down
10 changes: 7 additions & 3 deletions py-polars/polars/expr/binary.py
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,10 @@ def reinterpret(
self, *, dtype: PolarsDataType | DataTypeExpr, endianness: Endianness = "little"
) -> Expr:
r"""
Interpret a buffer as a numerical Polars type.
Interpret bytes as another type.

Supported types are numerical or temporal dtypes, or an ``Array`` of
these dtypes.

Parameters
----------
Expand All @@ -314,8 +317,9 @@ def reinterpret(
-------
Expr
Expression of data type `dtype`.
Note that if binary array is too short value will be null.
If binary array is too long, remainder will be ignored.
Note that rows of the binary array where the length does not match
the size in bytes of the output array (number of items * byte size
of item) will become NULL.

Examples
--------
Expand Down
10 changes: 7 additions & 3 deletions py-polars/polars/series/binary.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,7 +220,10 @@ def reinterpret(
self, *, dtype: PolarsDataType, endianness: Endianness = "little"
) -> Series:
r"""
Interpret a buffer as a numerical polars type.
Interpret bytes as another type.

Supported types are numerical or temporal dtypes, or an ``Array`` of
these dtypes.

Parameters
----------
Expand All @@ -233,8 +236,9 @@ def reinterpret(
-------
Series
Series of data type `dtype`.
Note that if binary array is too short value will be null.
If binary array is too long, remainder will be ignored.
Note that rows of the binary array where the length does not match
the size in bytes of the output array (number of items * byte size
of item) will become NULL.

Examples
--------
Expand Down
Loading
Loading