-
Notifications
You must be signed in to change notification settings - Fork 3k
Fix array cast/embed with null values #6283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
|
CI failures are unrelated |
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix !
|
I also plan to address #6280 (comment) in this PR :). |
|
Oh ok, ping me again whenever you want another review :) |
|
Have you had a chance to continue this ? I can also take a look if you want |
|
Yes, I'll finish it next week :). |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
This reverts commit 12c4c57.
|
@lhoestq Feel free to review this again. I've bumped PyArrow to 12.0.0 to simplify the implementation (no need for custom |
array.values handling in array cast/embed
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice ! Handling extension types can be quite complicated
Btw if you have some pyarrow issues we can link to this PR feel free to add them, this way we can follow the advancements and maybe later simplify this code
| if array_type != storage_type: | ||
| # Temporarily convert to the storage type to support extension types in the slice operation | ||
| array = _c(array, storage_type) | ||
| array = pc.list_slice(array, 0, pa_type.list_size, return_fixed_size_list=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may bring the data in memory no ?
maybe it's fine for now though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, pc.list_slice brings the array in memory. Unfortunately, I don't think it's possible to avoid this as we need to "expand" the null values in the .values array to prepare them for the ListArray -> FixedSizeListArray cast, which requires a memory allocation.
We (usually) run these casts on subtables before writing them to disk, so this solution should be fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting ListArray -> FixedSizeListArray using PyArrow's ListArray.cast also allocates memory if the array contains null lists, so indeed there isn't much we can do about this due to the difference in the null values storage layout.
Co-authored-by: Quentin Lhoest <[email protected]>
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
|
Still the problem is occured. |


Fixes issues with casting/embedding PyArrow list arrays with null values. It also bumps the required PyArrow version to 12.0.0 (over 9 months old) to simplify the implementation.
Fix #6280, fix #6311, fix #6360
(Also fixes #5430 to make Beam compatible with PyArrow>=12.0.0)