UnslothTrainer applies ChatML template although passed train dataset is pre-tokenized and contains 'input_ids' field

Version: unsloth 2025.2.15, unsloth_zoo 2025.2.7, transformers 4.49.0, trl: 0.15.1

Performing continued pretraining using unsloth/Meta-Llama-3.1-8B-Instruct. 
Pre-tokenized dataset that contains 'input_ids' field is passed to UnslothTrainer.
UnslothTrainer contructor starts converting passed train dataset to ChatML:

"Converting train dataset to ChatML (num_proc=8): ..."


Used debugger to pinpoint the issue. It seems that the problem is in 
unsloth_compiled_cache/UnslothSFTTrainer.py lines 663-670:

            # Convert the dataset to ChatML if needed
            if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
                map_kwargs["desc"] = f"Converting {dataset_name} dataset to ChatML"
            dataset = dataset.map(
                maybe_convert_to_chatml,
                remove_columns="conversations" if "conversations" in dataset.column_names else None,
                **map_kwargs,
            )

Conversion is performed although is_processed variable in line 626 was set to true:

        is_processed = "input_ids" in column_names
I
This did not occur unsloth 2025.1.8, unsloth_zoo 2025.1.5, transformers 4.48.2, trl 0.14.0

Trl sft_trainer.py looks ok (line 404):

if not is_processed:

This looks to be missing in UnslothSFTTrainer.py?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

UnslothTrainer applies ChatML template although passed train dataset is pre-tokenized and contains 'input_ids' field #1843

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

UnslothTrainer applies ChatML template although passed train dataset is pre-tokenized and contains 'input_ids' field #1843

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions