Skip to content

Batch map raises TypeError: '>=' not supported between instances of 'NoneType' and 'int' #6022

@codingl2k1

Description

@codingl2k1

Describe the bug

When mapping some datasets with batched=True, datasets may raise an exeception:

Traceback (most recent call last):
  File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
    for i, result in enumerate(func(**kwargs)):
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3483, in _map_single
    writer.write_batch(batch)
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_writer.py", line 549, in write_batch
    array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 2063, in cast_array_to_feature
    return feature.cast_storage(array)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/features/features.py", line 1098, in cast_storage
    if min_max["max"] >= self.num_classes:
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/Users/codingl2k1/Work/datasets/t1.py", line 33, in <module>
    ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 850, in map
    {
  File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 851, in <dictcomp>
    k: dataset.map(
       ^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 577, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 542, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3179, in map
    for rank, done, content in iflatmap_unordered(
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in iflatmap_unordered
    [async_result.get(timeout=0.05) for async_result in async_results]
  File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in <listcomp>
    [async_result.get(timeout=0.05) for async_result in async_results]
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
    raise self._value
TypeError: '>=' not supported between instances of 'NoneType' and 'int'

Steps to reproduce the bug

  1. Checkout the latest main of datasets.
  2. Run the code:
from datasets import load_dataset

def transforms(examples):
    # examples["pixel_values"] = [image.convert("RGB").resize((100, 100)) for image in examples["image"]]
    return examples

ds = load_dataset("scene_parse_150")
ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
print(ds)

Expected behavior

map without exception.

Environment info

Datasets: b8067c0
Python: 3.11.4
System: Macos

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions