Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Jun 8, 2021

There is this issue in pyarrow:

import pyarrow as pa

arr = pa.array([[i * 10] for i in range(4)])
arr.cast(pa.list_(pa.int32()))  # works

arr = arr.slice(1)
arr.cast(pa.list_(pa.int32()))  # fails
# ArrowNotImplementedError("Casting sliced lists (non-zero offset) not yet implemented")

However in Dataset.cast we slice tables to cast their types (it's memory intensive), so we have the same issue.
Because of this it is currently not possible to cast a Dataset with a Sequence feature type (unless the table is small enough to not be sliced).

In this PR I fixed this by resetting the offset of pyarrow.ListArray arrays to zero in the table before casting.
I used pyarrow.compute.subtract function to update the offsets of the ListArray.

cc @abhi1thakur @SBrandeis

@lhoestq lhoestq merged commit a7fd3e5 into master Jun 8, 2021
@lhoestq lhoestq deleted the support-sliced-list-arrays-in-cast branch June 8, 2021 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants