Skip to content

IterableDataset: corrupted EXIF image silently terminates streaming iterator instead of skipping the sample #8165

@LIUYellowBlack

Description

@LIUYellowBlack

Summary

When iterating over a streaming IterableDataset that contains images with corrupted EXIF metadata (e.g., a TIFF rational tag with denominator=0), PIL raises ZeroDivisionError inside exif_transpose. This exception propagates through the HuggingFace datasets pipeline and terminates the streaming iterator — the caller cannot catch it and resume from the next sample.

Environment

  • datasets version: 4.4.1
  • Pillow version: 10.x
  • Dataset: CC12M (WebDataset format, loaded as streaming IterableDataset)

Minimal Reproducer

from datasets import load_dataset

ds = load_dataset("pixparse/cc12m-wds", split="train", streaming=True)
it = iter(ds)
for i in range(10_000):
    try:
        sample = next(it)
    except ZeroDivisionError as e:
        print(f"sample {i}: {e}")
        # Try to resume — this raises StopIteration immediately,
        # because the internal generator's frame is already closed.
        next(it)
        break

Expected: skip the bad sample and continue.
Actual: StopIteration on the next call — iterator is dead.

Traceback

File "PIL/ImageOps.py", line 711, in exif_transpose
File "PIL/TiffImagePlugin.py", line 297, in _limit_rational
ZeroDivisionError: division by zero

Root Cause

PIL.ImageOps.exif_transpose does not guard against rational EXIF tags with denominator=0, which exist in real-world web-crawled datasets. Inside datasets, the call chain is roughly:

  • IterableDataset.__iter___iter_arrow / _iter_pytorch
  • Image.decode_example() (in datasets/features/image.py) →
  • PIL.Image.open(...) + PIL.ImageOps.exif_transpose(image) ← raises here

Because the exception is raised inside the internal generator, the generator's frame is closed when it propagates out of __next__, so calling next() again raises StopIteration.

Impact on Training Frameworks

For training frameworks using streaming datasets, none of the current workarounds is acceptable:

  1. Recreate the iterator after the exception → restarts from sample 0, causing silent data repetition (the entire dataset is re-looped from the beginning).
  2. Continue without recreating → raises StopIteration immediately, ends iteration.
  3. Monkey-patch PIL globally → invasive and hides data quality issues from the user.

This was discussed in pytorch/torchtitan#2550, where the reviewer (@wwwjn) explicitly recommended that the proper fix should live in datasets, not in the training framework.

Requested Fix

Add a skip_corrupted_images: bool = False option (or similar) to load_dataset() / IterableDataset.

  • skip_corrupted_images=False (default): preserve current behavior — let the exception propagate so users are aware of data quality issues.
  • skip_corrupted_images=True: catch decode errors inside the image feature decoder, emit a UserWarning (or expose a counter), and continue to the next sample without terminating the iterator.

Note: the skip_corrupted_images=True behavior requires the catch to happen inside the image decoder. Without catching the error before it propagates out of the generator, there is no way to skip the corrupted sample and resume — the iterator is already terminated at that point.

Open design questions for maintainers

A few things I'd like your guidance on before sending a PR:

  1. API surface: a single skip_corrupted_images: bool flag, or a more general on_decode_error: Literal["raise", "skip"] = "raise" so audio/video features can reuse it later?
  2. Exception scope: catch only (ZeroDivisionError, PIL.UnidentifiedImageError, OSError), or any Exception raised from decode_example?
  3. Visibility: UserWarning per skip, or a single warning + a counter exposed on the dataset object (e.g. ds.num_skipped)?
  4. Layer: implement inside Image.decode_example, or one level up in IterableDataset._iter_* so it works uniformly across feature types?

Happy to follow whichever direction you prefer.

Contribution

I'd be happy to submit a PR implementing this once we've agreed on the design above. Please let me know which option you prefer and I'll get started.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions