Skip to content

UnslothTrainer applies ChatML template although passed train dataset is pre-tokenized and contains 'input_ids' field #1843

@crto

Description

@crto

Version: unsloth 2025.2.15, unsloth_zoo 2025.2.7, transformers 4.49.0, trl: 0.15.1

Performing continued pretraining using unsloth/Meta-Llama-3.1-8B-Instruct.
Pre-tokenized dataset that contains 'input_ids' field is passed to UnslothTrainer.
UnslothTrainer contructor starts converting passed train dataset to ChatML:

"Converting train dataset to ChatML (num_proc=8): ..."

Used debugger to pinpoint the issue. It seems that the problem is in
unsloth_compiled_cache/UnslothSFTTrainer.py lines 663-670:

        # Convert the dataset to ChatML if needed
        if isinstance(dataset, Dataset):  # `IterableDataset.map` does not support `desc`
            map_kwargs["desc"] = f"Converting {dataset_name} dataset to ChatML"
        dataset = dataset.map(
            maybe_convert_to_chatml,
            remove_columns="conversations" if "conversations" in dataset.column_names else None,
            **map_kwargs,
        )

Conversion is performed although is_processed variable in line 626 was set to true:

    is_processed = "input_ids" in column_names

I
This did not occur unsloth 2025.1.8, unsloth_zoo 2025.1.5, transformers 4.48.2, trl 0.14.0

Trl sft_trainer.py looks ok (line 404):

if not is_processed:

This looks to be missing in UnslothSFTTrainer.py?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions