Summary
When iterating over a streaming IterableDataset that contains images with corrupted EXIF metadata (e.g., a TIFF rational tag with denominator=0), PIL raises ZeroDivisionError inside exif_transpose. This exception propagates through the HuggingFace datasets pipeline and terminates the streaming iterator — the caller cannot catch it and resume from the next sample.
Environment
datasets version: 4.4.1
Pillow version: 10.x
- Dataset: CC12M (WebDataset format, loaded as streaming
IterableDataset)
Minimal Reproducer
from datasets import load_dataset
ds = load_dataset("pixparse/cc12m-wds", split="train", streaming=True)
it = iter(ds)
for i in range(10_000):
try:
sample = next(it)
except ZeroDivisionError as e:
print(f"sample {i}: {e}")
# Try to resume — this raises StopIteration immediately,
# because the internal generator's frame is already closed.
next(it)
break
Expected: skip the bad sample and continue.
Actual: StopIteration on the next call — iterator is dead.
Traceback
File "PIL/ImageOps.py", line 711, in exif_transpose
File "PIL/TiffImagePlugin.py", line 297, in _limit_rational
ZeroDivisionError: division by zero
Root Cause
PIL.ImageOps.exif_transpose does not guard against rational EXIF tags with denominator=0, which exist in real-world web-crawled datasets. Inside datasets, the call chain is roughly:
IterableDataset.__iter__ → _iter_arrow / _iter_pytorch →
Image.decode_example() (in datasets/features/image.py) →
PIL.Image.open(...) + PIL.ImageOps.exif_transpose(image) ← raises here
Because the exception is raised inside the internal generator, the generator's frame is closed when it propagates out of __next__, so calling next() again raises StopIteration.
Impact on Training Frameworks
For training frameworks using streaming datasets, none of the current workarounds is acceptable:
- Recreate the iterator after the exception → restarts from sample 0, causing silent data repetition (the entire dataset is re-looped from the beginning).
- Continue without recreating → raises
StopIteration immediately, ends iteration.
- Monkey-patch PIL globally → invasive and hides data quality issues from the user.
This was discussed in pytorch/torchtitan#2550, where the reviewer (@wwwjn) explicitly recommended that the proper fix should live in datasets, not in the training framework.
Requested Fix
Add a skip_corrupted_images: bool = False option (or similar) to load_dataset() / IterableDataset.
skip_corrupted_images=False (default): preserve current behavior — let the exception propagate so users are aware of data quality issues.
skip_corrupted_images=True: catch decode errors inside the image feature decoder, emit a UserWarning (or expose a counter), and continue to the next sample without terminating the iterator.
Note: the skip_corrupted_images=True behavior requires the catch to happen inside the image decoder. Without catching the error before it propagates out of the generator, there is no way to skip the corrupted sample and resume — the iterator is already terminated at that point.
Open design questions for maintainers
A few things I'd like your guidance on before sending a PR:
- API surface: a single
skip_corrupted_images: bool flag, or a more general on_decode_error: Literal["raise", "skip"] = "raise" so audio/video features can reuse it later?
- Exception scope: catch only
(ZeroDivisionError, PIL.UnidentifiedImageError, OSError), or any Exception raised from decode_example?
- Visibility:
UserWarning per skip, or a single warning + a counter exposed on the dataset object (e.g. ds.num_skipped)?
- Layer: implement inside
Image.decode_example, or one level up in IterableDataset._iter_* so it works uniformly across feature types?
Happy to follow whichever direction you prefer.
Contribution
I'd be happy to submit a PR implementing this once we've agreed on the design above. Please let me know which option you prefer and I'll get started.
Summary
When iterating over a streaming
IterableDatasetthat contains images with corrupted EXIF metadata (e.g., a TIFF rational tag with denominator=0), PIL raisesZeroDivisionErrorinsideexif_transpose. This exception propagates through the HuggingFacedatasetspipeline and terminates the streaming iterator — the caller cannot catch it and resume from the next sample.Environment
datasetsversion: 4.4.1Pillowversion: 10.xIterableDataset)Minimal Reproducer
Expected: skip the bad sample and continue.
Actual:
StopIterationon the next call — iterator is dead.Traceback
Root Cause
PIL.ImageOps.exif_transposedoes not guard against rational EXIF tags with denominator=0, which exist in real-world web-crawled datasets. Insidedatasets, the call chain is roughly:IterableDataset.__iter__→_iter_arrow/_iter_pytorch→Image.decode_example()(indatasets/features/image.py) →PIL.Image.open(...)+PIL.ImageOps.exif_transpose(image)← raises hereBecause the exception is raised inside the internal generator, the generator's frame is closed when it propagates out of
__next__, so callingnext()again raisesStopIteration.Impact on Training Frameworks
For training frameworks using streaming datasets, none of the current workarounds is acceptable:
StopIterationimmediately, ends iteration.This was discussed in pytorch/torchtitan#2550, where the reviewer (
@wwwjn) explicitly recommended that the proper fix should live indatasets, not in the training framework.Requested Fix
Add a
skip_corrupted_images: bool = Falseoption (or similar) toload_dataset()/IterableDataset.skip_corrupted_images=False(default): preserve current behavior — let the exception propagate so users are aware of data quality issues.skip_corrupted_images=True: catch decode errors inside the image feature decoder, emit aUserWarning(or expose a counter), and continue to the next sample without terminating the iterator.Note: the
skip_corrupted_images=Truebehavior requires the catch to happen inside the image decoder. Without catching the error before it propagates out of the generator, there is no way to skip the corrupted sample and resume — the iterator is already terminated at that point.Open design questions for maintainers
A few things I'd like your guidance on before sending a PR:
skip_corrupted_images: boolflag, or a more generalon_decode_error: Literal["raise", "skip"] = "raise"so audio/video features can reuse it later?(ZeroDivisionError, PIL.UnidentifiedImageError, OSError), or anyExceptionraised fromdecode_example?UserWarningper skip, or a single warning + a counter exposed on the dataset object (e.g.ds.num_skipped)?Image.decode_example, or one level up inIterableDataset._iter_*so it works uniformly across feature types?Happy to follow whichever direction you prefer.
Contribution
I'd be happy to submit a PR implementing this once we've agreed on the design above. Please let me know which option you prefer and I'll get started.