-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
When loading image data, HF datasets seems to recognize .jpg and .jpeg file extensions, but not e.g. .JPEG. As the naming convention .JPEG is used in important datasets such as imagenet, I would welcome if according extensions like .JPEG or .JPG would be allowed.
Steps to reproduce the bug
# use bash to create 2 sham datasets with jpeg and JPEG ext
!mkdir dataset_a
!mkdir dataset_b
!wget https://upload.wikimedia.org/wikipedia/commons/7/71/Dsc_%28179253513%29.jpeg -O example_img.jpeg
!cp example_img.jpeg ./dataset_a/
!mv example_img.jpeg ./dataset_b/example_img.JPEG
from datasets import load_dataset
# working
df1 = load_dataset("./dataset_a", ignore_verifications=True)
#not working
df2 = load_dataset("./dataset_b", ignore_verifications=True)
# show
print(df1, df2)Expected results
DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 1
})
}) DatasetDict({
train: Dataset({
features: ['image', 'label'],
num_rows: 1
})
})
Actual results
FileNotFoundError: Unable to resolve any data file that matches '['**']' at /..PATH../dataset_b with any supported extension ['csv', 'tsv', 'json', 'jsonl', 'parquet', 'txt', 'blp', 'bmp', 'dib', 'bufr', 'cur', 'pcx', 'dcx', 'dds', 'ps', 'eps', 'fit', 'fits', 'fli', 'flc', 'ftc', 'ftu', 'gbr', 'gif', 'grib', 'h5', 'hdf', 'png', 'apng', 'jp2', 'j2k', 'jpc', 'jpf', 'jpx', 'j2c', 'icns', 'ico', 'im', 'iim', 'tif', 'tiff', 'jfif', 'jpe', 'jpg', 'jpeg', 'mpg', 'mpeg', 'msp', 'pcd', 'pxr', 'pbm', 'pgm', 'ppm', 'pnm', 'psd', 'bw', 'rgb', 'rgba', 'sgi', 'ras', 'tga', 'icb', 'vda', 'vst', 'webp', 'wmf', 'emf', 'xbm', 'xpm', 'zip']
I know that it can be annoying to allow seemingly arbitrary numbers of file extensions. But I think this one would be really welcome.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working