-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
When mapping some datasets with batched=True, datasets may raise an exeception:
Traceback (most recent call last):
File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1328, in _write_generator_to_queue
for i, result in enumerate(func(**kwargs)):
File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3483, in _map_single
writer.write_batch(batch)
File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_writer.py", line 549, in write_batch
array = cast_array_to_feature(col_values, col_type) if col_type is not None else col_values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in wrapper
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 1831, in <listcomp>
return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/table.py", line 2063, in cast_array_to_feature
return feature.cast_storage(array)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/features/features.py", line 1098, in cast_storage
if min_max["max"] >= self.num_classes:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/codingl2k1/Work/datasets/t1.py", line 33, in <module>
ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 850, in map
{
File "/Users/codingl2k1/Work/datasets/src/datasets/dataset_dict.py", line 851, in <dictcomp>
k: dataset.map(
^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 577, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 542, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/src/datasets/arrow_dataset.py", line 3179, in map
for rank, done, content in iflatmap_unordered(
File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in iflatmap_unordered
[async_result.get(timeout=0.05) for async_result in async_results]
File "/Users/codingl2k1/Work/datasets/src/datasets/utils/py_utils.py", line 1368, in <listcomp>
[async_result.get(timeout=0.05) for async_result in async_results]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/codingl2k1/Work/datasets/venv/lib/python3.11/site-packages/multiprocess/pool.py", line 774, in get
raise self._value
TypeError: '>=' not supported between instances of 'NoneType' and 'int'Steps to reproduce the bug
- Checkout the latest main of datasets.
- Run the code:
from datasets import load_dataset
def transforms(examples):
# examples["pixel_values"] = [image.convert("RGB").resize((100, 100)) for image in examples["image"]]
return examples
ds = load_dataset("scene_parse_150")
ds = ds.map(transforms, num_proc=14, batched=True, batch_size=5)
print(ds)Expected behavior
map without exception.
Environment info
Datasets: b8067c0
Python: 3.11.4
System: Macos
Metadata
Metadata
Assignees
Labels
No labels