Skip to content

Allow .JPEG as a file extension #4514

@DiGyt

Description

@DiGyt

Describe the bug

When loading image data, HF datasets seems to recognize .jpg and .jpeg file extensions, but not e.g. .JPEG. As the naming convention .JPEG is used in important datasets such as imagenet, I would welcome if according extensions like .JPEG or .JPG would be allowed.

Steps to reproduce the bug

# use bash to create 2 sham datasets with jpeg and JPEG ext
!mkdir dataset_a
!mkdir dataset_b
!wget https://upload.wikimedia.org/wikipedia/commons/7/71/Dsc_%28179253513%29.jpeg -O example_img.jpeg
!cp example_img.jpeg ./dataset_a/
!mv example_img.jpeg ./dataset_b/example_img.JPEG

from datasets import load_dataset

# working
df1 = load_dataset("./dataset_a", ignore_verifications=True)

#not working
df2 = load_dataset("./dataset_b", ignore_verifications=True)

# show
print(df1, df2)

Expected results

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 1
    })
}) DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 1
    })
})

Actual results

FileNotFoundError: Unable to resolve any data file that matches '['**']' at /..PATH../dataset_b with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']

I know that it can be annoying to allow seemingly arbitrary numbers of file extensions. But I think this one would be really welcome.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions