Skip to content

Errror when saving to disk a dataset of images #5717

@jplu

Description

@jplu

Describe the bug

Hello!

I have an issue when I try to save on disk my dataset of images. The error I get is:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1442, in save_to_disk
    for job_id, done, content in Dataset._save_to_disk_single(**kwargs):
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 1473, in _save_to_disk_single
    writer.write_table(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/arrow_writer.py", line 570, in write_table
    pa_table = embed_table_storage(pa_table)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2268, in embed_table_storage
    arrays = [
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2269, in <listcomp>
    embed_array_storage(table[name], feature) if require_storage_embed(feature) else table[name]
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 1817, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/table.py", line 2142, in embed_array_storage
    return feature.embed_storage(array)
  File "/home/jplu/miniconda3/envs/image-xp/lib/python3.10/site-packages/datasets/features/image.py", line 269, in embed_storage
    storage = pa.StructArray.from_arrays([bytes_array, path_array], ["bytes", "path"], mask=bytes_array.is_null())
  File "pyarrow/array.pxi", line 2766, in pyarrow.lib.StructArray.from_arrays
  File "pyarrow/array.pxi", line 2961, in pyarrow.lib.c_mask_inverted_from_obj
TypeError: Mask must be a pyarrow.Array of type boolean

My dataset is around 50K images, is this error might be due to a bad image?

Thanks for the help.

Steps to reproduce the bug

from datasets import load_dataset
dataset = load_dataset("imagefolder", data_dir="/path/to/dataset")
dataset["train"].save_to_disk("./myds", num_shards=40)

Expected behavior

Having my dataset properly saved to disk.

Environment info

  • datasets version: 2.11.0
  • Platform: Linux-5.15.90.1-microsoft-standard-WSL2-x86_64-with-glibc2.35
  • Python version: 3.10.10
  • Huggingface_hub version: 0.13.3
  • PyArrow version: 11.0.0
  • Pandas version: 2.0.0

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions