Skip to content

Transform with columns parameter triggers on non-specified column access #7842

@mr-brobot

Description

@mr-brobot

Describe the bug

Iterating over a Column iterates through the parent Dataset and applies all formatting/transforms on each row, regardless of which column is being accessed. This causes an error when transforms depend on columns not present in the projection.

Steps to reproduce the bug

Load a dataset with multiple columns

ds = load_dataset("mrbrobot/isic-2024", split="train")

Define a transform that specifies an input column

def image_transform(batch):
    batch["image"] = batch["image"]  # KeyError when batch doesn't contain "image"
    return batch

# apply transform only to image column
ds = ds.with_format("torch")
ds = ds.with_transform(image_transform, columns=["image"], output_all_columns=True)

Iterate over non-specified column

# iterate over a different column, triggers the transform on each row, but batch doesn't contain "image"
for t in ds["target"]:  # KeyError: 'image'
    print(t)

Expected behavior

If a user iterates over ds["target"] and the transform specifies columns=["image"], the transform should be skipped.

Environment info

datasets: 4.2.0
Python: 3.12.12
Linux: Debian 11.11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions