IterableDataset: corrupted EXIF image silently terminates streaming iterator instead of skipping the sample

## Summary

When iterating over a streaming `IterableDataset` that contains images with corrupted EXIF metadata (e.g., a TIFF rational tag with denominator=0), PIL raises `ZeroDivisionError` inside `exif_transpose`. This exception propagates through the HuggingFace `datasets` pipeline and **terminates the streaming iterator** — the caller cannot catch it and resume from the next sample.

## Environment

- `datasets` version: 4.4.1
- `Pillow` version: 10.x
- Dataset: CC12M (WebDataset format, loaded as streaming `IterableDataset`)

## Minimal Reproducer

```python
from datasets import load_dataset

ds = load_dataset("pixparse/cc12m-wds", split="train", streaming=True)
it = iter(ds)
for i in range(10_000):
    try:
        sample = next(it)
    except ZeroDivisionError as e:
        print(f"sample {i}: {e}")
        # Try to resume — this raises StopIteration immediately,
        # because the internal generator's frame is already closed.
        next(it)
        break
```

Expected: skip the bad sample and continue.
Actual: `StopIteration` on the next call — iterator is dead.

## Traceback

```
File "PIL/ImageOps.py", line 711, in exif_transpose
File "PIL/TiffImagePlugin.py", line 297, in _limit_rational
ZeroDivisionError: division by zero
```

## Root Cause

`PIL.ImageOps.exif_transpose` does not guard against rational EXIF tags with denominator=0, which exist in real-world web-crawled datasets. Inside `datasets`, the call chain is roughly:

- `IterableDataset.__iter__` → `_iter_arrow` / `_iter_pytorch` →
- `Image.decode_example()` (in `datasets/features/image.py`) →
- `PIL.Image.open(...)` + `PIL.ImageOps.exif_transpose(image)` ← raises here

Because the exception is raised inside the internal generator, the generator's frame is closed when it propagates out of `__next__`, so calling `next()` again raises `StopIteration`.

## Impact on Training Frameworks

For training frameworks using streaming datasets, none of the current workarounds is acceptable:

1. **Recreate the iterator** after the exception → restarts from sample 0, causing silent data repetition (the entire dataset is re-looped from the beginning).
2. **Continue without recreating** → raises `StopIteration` immediately, ends iteration.
3. **Monkey-patch PIL globally** → invasive and hides data quality issues from the user.

This was discussed in [pytorch/torchtitan#2550](https://github.com/pytorch/torchtitan/pull/2550), where the reviewer (`@wwwjn`) explicitly recommended that the proper fix should live in `datasets`, not in the training framework.

## Requested Fix

Add a `skip_corrupted_images: bool = False` option (or similar) to `load_dataset()` / `IterableDataset`.

- **`skip_corrupted_images=False` (default):** preserve current behavior — let the exception propagate so users are aware of data quality issues.
- **`skip_corrupted_images=True`:** catch decode errors **inside** the image feature decoder, emit a `UserWarning` (or expose a counter), and continue to the next sample without terminating the iterator.

**Note:** the `skip_corrupted_images=True` behavior **requires** the catch to happen inside the image decoder. Without catching the error before it propagates out of the generator, there is no way to skip the corrupted sample and resume — the iterator is already terminated at that point.

## Open design questions for maintainers

A few things I'd like your guidance on before sending a PR:

1. **API surface**: a single `skip_corrupted_images: bool` flag, or a more general `on_decode_error: Literal["raise", "skip"] = "raise"` so audio/video features can reuse it later?
2. **Exception scope**: catch only `(ZeroDivisionError, PIL.UnidentifiedImageError, OSError)`, or any `Exception` raised from `decode_example`?
3. **Visibility**: `UserWarning` per skip, or a single warning + a counter exposed on the dataset object (e.g. `ds.num_skipped`)?
4. **Layer**: implement inside `Image.decode_example`, or one level up in `IterableDataset._iter_*` so it works uniformly across feature types?

Happy to follow whichever direction you prefer.

## Contribution

I'd be happy to submit a PR implementing this once we've agreed on the design above. Please let me know which option you prefer and I'll get started.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterableDataset: corrupted EXIF image silently terminates streaming iterator instead of skipping the sample #8165

Summary

Environment

Minimal Reproducer

Traceback

Root Cause

Impact on Training Frameworks

Requested Fix

Open design questions for maintainers

Contribution

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

IterableDataset: corrupted EXIF image silently terminates streaming iterator instead of skipping the sample #8165

Description

Summary

Environment

Minimal Reproducer

Traceback

Root Cause

Impact on Training Frameworks

Requested Fix

Open design questions for maintainers

Contribution

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions