Skip to content

Couldn't cast array of type fixed_size_list to Sequence(Value(float64)) #6280

@jmif

Description

@jmif

Describe the bug

I have a dataset with an embedding column, when I try to map that dataset I get the following exception:

Traceback (most recent call last):
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3189, in map
    for rank, done, content in iflatmap_unordered(
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 1387, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/jmif/.virtualenvs/llm-training/lib/python3.10/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: Couldn't cast array of type
fixed_size_list<item: float>[2]
to
Sequence(feature=Value(dtype='float32', id=None), length=2, id=None)

Steps to reproduce the bug

Here's a simple repro script:

from datasets import Features, Value, Sequence, ClassLabel, Dataset

dataset_features = Features({
    'text': Value('string'),
    'embedding': Sequence(Value('double'), length=2),
    'categories': Sequence(ClassLabel(names=sorted([
        'one',
        'two',
        'three'
    ]))),
})

dataset = Dataset.from_dict(
    {
        'text': ['A'] * 10000,
        'embedding': [[0.0, 0.1]] * 10000,
        'categories': [[0]] * 10000,
    },
    features=dataset_features
)

def test_mapper(r):
    r['text'] = list(map(lambda t: t + ' b', r['text']))
    return r


dataset = dataset.map(test_mapper, batched=True, batch_size=10, features=dataset_features, num_proc=2)

Removing the embedding column fixes the issue!

Expected behavior

The mapping completes successfully.

Environment info

  • datasets version: 2.14.4
  • Platform: macOS-14.0-arm64-arm-64bit
  • Python version: 3.10.12
  • Huggingface_hub version: 0.17.1
  • PyArrow version: 13.0.0
  • Pandas version: 2.0.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions