-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
I encountered a TypeError when batch processing a dataset with Sequence features in datasets package version 2.16.1. The error arises from a mismatch in handling fixed-size list arrays during the map function execution. Debugging pinpoints the issue to an if-statement in datasets/table.py, line 2093, failing to correctly process sequence lengths.
Steps to reproduce the bug
Create virtual environment and activate
virtualenv venv
source venv/bin/activate
Then install the datasets package (I'm using the latest version)
pip install datasets==2.16.1
Then run
# bug.py
from datasets import Dataset
from datasets.features import Features, Sequence, Value
data = {
"num": [[1, 2], [3, 4]],
}
features = Features({'num': Sequence(feature=Value(dtype='int32'), length=2)})
dataset = Dataset.from_dict(data, features=features)
dataset.map(lambda x: x, batched=True, batch_size=1)Expected behavior
I get the following stack trace
Map: 50%|█████ | 1/2 [00:00<00:00, 423.92 examples/s]
Traceback (most recent call last):
File "/PATH/TO/BUG_PORT/bug.py", line 9, in <module>
dataset.map(lambda x: x, batched=True, batch_size=1)
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 592, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3093, in map
for rank, done, content in Dataset._map_single(**dataset_kwargs):
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3489, in _map_single
writer.write_batch(batch)
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/arrow_writer.py", line 551, in write_batch
array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 1797, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 1797, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
File "/PATH/TO/BUG_PORT/venv/lib/python3.9/site-packages/datasets/table.py", line 2111, in cast_array_to_feature
raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
fixed_size_list<item: int32>[2]
to
Sequence(feature=Value(dtype='int32', id=None), length=2, id=None)
After some debugging, I found that the if-statement that is actually failing is line 2093 in datasets/table.py
# datasets/table.py
...
2093 if feature.length * len(array) == len(array_values):
2094 return pa.FixedSizeListArray.from_arrays(_c(array_values, feature.feature), feature.length)
...Environment info
Platform: MacOS
Datasets version: datasets==2.16.1
Python version: 3.9.6
Metadata
Metadata
Assignees
Labels
No labels