-
Notifications
You must be signed in to change notification settings - Fork 3k
Closed
Description
Describe the bug
Iterating over a Column iterates through the parent Dataset and applies all formatting/transforms on each row, regardless of which column is being accessed. This causes an error when transforms depend on columns not present in the projection.
Steps to reproduce the bug
Load a dataset with multiple columns
ds = load_dataset("mrbrobot/isic-2024", split="train")Define a transform that specifies an input column
def image_transform(batch):
batch["image"] = batch["image"] # KeyError when batch doesn't contain "image"
return batch
# apply transform only to image column
ds = ds.with_format("torch")
ds = ds.with_transform(image_transform, columns=["image"], output_all_columns=True)Iterate over non-specified column
# iterate over a different column, triggers the transform on each row, but batch doesn't contain "image"
for t in ds["target"]: # KeyError: 'image'
print(t)Expected behavior
If a user iterates over ds["target"] and the transform specifies columns=["image"], the transform should be skipped.
Environment info
datasets: 4.2.0
Python: 3.12.12
Linux: Debian 11.11
Metadata
Metadata
Assignees
Labels
No labels