-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
I have a dataset with an embedding column, when I try to map that dataset I get the following exception:
Traceback (most recent call last):
File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3189, in map
for rank, done, content in iflatmap_unordered(
File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
raise self._value
TypeError: Couldn't cast array of type
fixed_size_list<item: float>[2]
to
Sequence(feature=Value(dtype='float32', id=None), length=2, id=None)
Steps to reproduce the bug
Here's a simple repro script:
from datasets import Features, Value, Sequence, ClassLabel, Dataset
dataset_features = Features({
'text': Value('string'),
'embedding': Sequence(Value('double'), length=2),
'categories': Sequence(ClassLabel(names=sorted([
'one',
'two',
'three'
]))),
})
dataset = Dataset.from_dict(
{
'text': ['A'] * 10000,
'embedding': [[0.0, 0.1]] * 10000,
'categories': [[0]] * 10000,
},
features=dataset_features
)
def test_mapper(r):
r['text'] = list(map(lambda t: t + ' b', r['text']))
return r
dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)
Removing the embedding column fixes the issue!
Expected behavior
The mapping completes successfully.
Environment info
datasetsversion: 2.14.4- Platform: macOS-14.0-arm64-arm-64bit
- Python version: 3.10.12
- Huggingface_hub version: 0.17.1
- PyArrow version: 13.0.0
- Pandas version: 2.0.3
Emrys-Merlin
Metadata
Metadata
Assignees
Labels
No labels