Skip to content

Batched dataset map throws exception that cannot cast fixed length array to Sequence #6654

@keesjandevries

Description

@keesjandevries

Describe the bug

I encountered a TypeError when batch processing a dataset with Sequence features in datasets package version 2.16.1. The error arises from a mismatch in handling fixed-size list arrays during the map function execution. Debugging pinpoints the issue to an if-statement in datasets/table.py, line 2093, failing to correctly process sequence lengths.

Steps to reproduce the bug

Create virtual environment and activate

virtualenv venv
source venv/bin/activate

Then install the datasets package (I'm using the latest version)

pip install datasets==2.16.1

Then run

# bug.py
from datasets import Dataset
from datasets.features import Features, Sequence, Value

data = {
    "num": [[1, 2], [3, 4]],
}
features = Features({'num': Sequence(feature=Value(dtype='int32'), length=2)})
dataset = Dataset.from_dict(data, features=features)
dataset.map(lambda x: x, batched=True, batch_size=1)

Expected behavior

I get the following stack trace

Map:  50%|█████     | 1/2 [00:00<00:00, 423.92 examples/s]
Traceback (most recent call last):
  File "/PATH/TO/BUG_PORT/bug.py", line 9, in <module>
    dataset.map(lambda x: x, batched=True, batch_size=1)
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3093, in map
    for rank, done, content in Dataset._map_single(**dataset_kwargs):
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3489, in _map_single
    writer.write_batch(batch)
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 551, in write_batch
    array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 1797, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 1797, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 2111, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
fixed_size_list<item: int32>[2]
to
Sequence(feature=Value(dtype='int32', id=None), length=2, id=None)

After some debugging, I found that the if-statement that is actually failing is line 2093 in datasets/table.py

# datasets/table.py
                ...
2093                if feature.length * len(array) == len(array_values):
2094                    return pa.FixedSizeListArray.from_arrays(_c(array_values, feature.feature), feature.length)
                ...

Environment info

Platform: MacOS
Datasets version: datasets==2.16.1
Python version: 3.9.6

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions