fix(sft): reject transformed datasets during preparation by he-yufeng · Pull Request #6054 · huggingface/trl

he-yufeng · 2026-06-13T23:31:23Z

SFTTrainer currently lets Dataset.with_transform() datasets enter the automatic preparation pipeline. That path calls Dataset.map() for EOS insertion and tokenization, and map() reads rows through the active custom transform. For stateful or random transforms, that can bake one transform realization into the prepared Arrow columns while later accesses still run a different transform.

This PR adds a guard before SFT dataset preparation: if the dataset has a custom transform, fail with a clear error explaining why automatic preparation is unsafe and pointing users to either dataset_kwargs={"skip_prepare_dataset": True} with already-tokenized examples, or materializing the transform with Dataset.map() before constructing the trainer.

I kept this as the conservative fix from the issue. It does not attempt lazy transform composition or packing support in this PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link to it if that's the case. SFTTrainer silently breaks datasets that use Dataset.with_transform #6039
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

AI writing disclosure

No AI usage: the PR was written entirely by a human.
AI-assisted: some parts were suggested or improved by AI, but the PR was written and reviewed by a human.
AI-generated: the PR was mostly or fully generated by an AI tool.

Tests

python -m ruff check trl\\trainer\\sft_trainer.py tests\\test_sft_trainer.py
.\\.venv\\Scripts\\python.exe -m py_compile trl\\trainer\\sft_trainer.py tests\\test_sft_trainer.py
.\\.venv\\Scripts\\python.exe -m pytest tests\\test_sft_trainer.py::TestSFTTrainer::test_dataset_with_transform_requires_skip_prepare_dataset -q

One nearby skip-prepare test was also attempted in this local Windows venv, but it fails before reaching trainer logic because the current CPU-only environment rejects the default bf16 settings. I did not change existing tests to hide that environment issue.

Note

Low Risk
Small, early validation in the dataset prep path; only blocks an unsafe pattern and does not change successful training flows.

Overview
SFTTrainer now refuses datasets created with Dataset.with_transform() when automatic dataset preparation runs. Preparation uses Dataset.map(), which materializes rows through the lazy transform and can freeze one random or stateful augmentation into Arrow columns while later reads still apply a different transform.

At the start of _prepare_dataset, the trainer checks for a custom dataset format and raises ValueError with guidance to use dataset_kwargs={'skip_prepare_dataset': True} with trainer-ready (e.g. tokenized) examples from the transform, or to map() deterministic transforms before building the trainer.

A regression test asserts that SFTTrainer construction fails on a with_transform dataset with the expected message.

^{Reviewed by Cursor Bugbot for commit 27def35. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 1c9f468. Configure here.}

cursor · 2026-06-13T23:34:52Z



+def _dataset_has_custom_transform(dataset: Dataset | IterableDataset) -> bool:
+    return getattr(dataset, "format", {}).get("type") == "custom"


Iterable transform guard misses stream

Medium Severity

_dataset_has_custom_transform checks dataset.format via getattr, but elsewhere in TRL, IterableDataset formatting is read from _formatting.format_type in _get_dataset_format. If streaming datasets lack a format dict with type == "custom", _prepare_dataset can still run map() on a lazy custom transform.

^{Reviewed by Cursor Bugbot for commit 1c9f468. Configure here.}

There is no public API path that puts a custom transform on an IterableDataset.

I think this would need an isinstance(dataset, Dataset) guard to make this explicit and close the concern permanently.

albertvillanova

Thanks for addressing the underlying issue. Below my concerns and suggested changes.

albertvillanova · 2026-06-15T09:36:55Z

    from peft import PeftConfig, PeftModel, PeftType, get_peft_model


+def _dataset_has_custom_transform(dataset: Dataset | IterableDataset) -> bool:


I would inline this function: I think it creates an unnecessary indirection for a one-liner that names a well-understood condition.

albertvillanova · 2026-06-15T09:38:26Z



+def _dataset_has_custom_transform(dataset: Dataset | IterableDataset) -> bool:
+    return getattr(dataset, "format", {}).get("type") == "custom"


This guard is defensive programming and should be removed.

albertvillanova · 2026-06-15T09:42:34Z

+            return batch
+
+        dataset = dataset.with_transform(add_suffix)
+        training_args = SFTConfig(output_dir=self.tmp_dir, report_to="none", use_cpu=True, bf16=False)


The guards for use_cpu and bf16 are not appropriate.

albertvillanova · 2026-06-15T09:46:12Z



+def _dataset_has_custom_transform(dataset: Dataset | IterableDataset) -> bool:
+    return getattr(dataset, "format", {}).get("type") == "custom"


There is no public API path that puts a custom transform on an IterableDataset.

I think this would need an isinstance(dataset, Dataset) guard to make this explicit and close the concern permanently.

he-yufeng · 2026-06-15T12:35:40Z

Addressed the review in cee507b9:

inlined the custom-transform check
limited it explicitly to datasets.Dataset; streaming IterableDataset is unaffected
removed the defensive getattr / .get path
removed the test's use_cpu and bf16 overrides
rebased onto current main

Validation:

python -m ruff check trl/trainer/sft_trainer.py tests/test_sft_trainer.py
python -m ruff format --check trl/trainer/sft_trainer.py tests/test_sft_trainer.py
git diff --check

All passed. The focused pytest now reaches the test but this local Windows machine cannot construct the repository-default SFTConfig because it has no bf16-capable GPU. I left the device-specific overrides removed as requested and will rely on the repository CI environment for that test.

albertvillanova

Thanks for addressing the suggested changes. Just one additional suggestion below.

Additionally, would you mind not force-pushing to a public branch? That makes the review process more difficult because it it not possible to check only the modification to the previous reviewed PR code (by reading only the specific lines have been modified), and instead forces to review the entire PR.

albertvillanova · 2026-06-15T13:54:13Z

+            )
+
        # If the dataset is already preprocessed (tokenized), skip the processing steps.
        column_names = get_dataset_column_names(dataset)


I think the error message is misleading: it says "provide already tokenized examples", which implies the user must pre-tokenize outside the transform. The correct pattern (transform that augments AND tokenizes) should be stated explicitly.

The second suggested workaround ("materialize the transform with Dataset.map() before constructing the trainer") is correct only for deterministic transforms; for the reported use case (random augmentation) it defeats the point and should be qualified or dropped.

Agreed. I updated the message in 70cc219f so it no longer implies users must pre-tokenize outside the transform.

It now says to use skip_prepare_dataset=True and make the transform return trainer-ready examples, including tokenized fields. The Dataset.map() workaround is now qualified to deterministic transforms only.

Validated locally:

python -m py_compile trl\trainer\sft_trainer.py tests\test_sft_trainer.py python -m ruff check trl\trainer\sft_trainer.py tests\test_sft_trainer.py python -m ruff format --check trl\trainer\sft_trainer.py tests\test_sft_trainer.py git diff --check origin/main..HEAD

The focused pytest still fails before reaching this assertion on my Windows machine because the repository-default SFTConfig requires bf16 GPU support here. I kept the previous review request and did not reintroduce use_cpu / bf16 test overrides.

bot-ci-comment · 2026-06-15T14:57:51Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

albertvillanova

Thanks again for your contribution and for addressing all suggestions. It looks good to me.

cursor Bot reviewed Jun 13, 2026

View reviewed changes

albertvillanova mentioned this pull request Jun 15, 2026

fix(sft): raise clear ValueError when Dataset.with_transform is passed to SFTTrainer #6061

Closed

8 tasks

albertvillanova requested changes Jun 15, 2026

View reviewed changes

fix(sft): reject transformed datasets during preparation

cee507b

albertvillanova requested changes Jun 15, 2026

View reviewed changes

fix: clarify transform dataset error

70cc219

albertvillanova approved these changes Jun 16, 2026

View reviewed changes

Merge branch 'main' into fix/sft-with-transform-guard

27def35

albertvillanova merged commit f92a846 into huggingface:main Jun 16, 2026
12 checks passed



		def _dataset_has_custom_transform(dataset: Dataset \| IterableDataset) -> bool:
		return getattr(dataset, "format", {}).get("type") == "custom"

		from peft import PeftConfig, PeftModel, PeftType, get_peft_model


		def _dataset_has_custom_transform(dataset: Dataset \| IterableDataset) -> bool:

Conversation

he-yufeng commented Jun 13, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Before submitting

AI writing disclosure

Tests

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 13, 2026

Choose a reason for hiding this comment

Iterable transform guard misses stream

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng commented Jun 15, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

albertvillanova Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

he-yufeng Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

bot-ci-comment Bot commented Jun 15, 2026

Uh oh!

albertvillanova left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

he-yufeng commented Jun 13, 2026 •

edited by cursor Bot

Loading