-
-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Description
Version: unsloth 2025.2.15, unsloth_zoo 2025.2.7, transformers 4.49.0, trl: 0.15.1
Performing continued pretraining using unsloth/Meta-Llama-3.1-8B-Instruct.
Pre-tokenized dataset that contains 'input_ids' field is passed to UnslothTrainer.
UnslothTrainer contructor starts converting passed train dataset to ChatML:
"Converting train dataset to ChatML (num_proc=8): ..."
Used debugger to pinpoint the issue. It seems that the problem is in
unsloth_compiled_cache/UnslothSFTTrainer.py lines 663-670:
# Convert the dataset to ChatML if needed
if isinstance(dataset, Dataset): # `IterableDataset.map` does not support `desc`
map_kwargs["desc"] = f"Converting {dataset_name} dataset to ChatML"
dataset = dataset.map(
maybe_convert_to_chatml,
remove_columns="conversations" if "conversations" in dataset.column_names else None,
**map_kwargs,
)
Conversion is performed although is_processed variable in line 626 was set to true:
is_processed = "input_ids" in column_names
I
This did not occur unsloth 2025.1.8, unsloth_zoo 2025.1.5, transformers 4.48.2, trl 0.14.0
Trl sft_trainer.py looks ok (line 404):
if not is_processed:
This looks to be missing in UnslothSFTTrainer.py?