Skip to content

Commit 208e9f7

Browse files
📏 torch_dype to dtype everywhere (#4000)
Co-authored-by: Quentin Gallouédec <[email protected]>
1 parent 3bfa981 commit 208e9f7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

55 files changed

+200
-222
lines changed

.github/workflows/tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -235,7 +235,7 @@ jobs:
235235
uv pip install ".[dev]"
236236
uv pip install accelerate==1.4.0
237237
uv pip install datasets==3.0.0
238-
uv pip install transformers==4.55.0
238+
uv pip install transformers==4.56.0
239239
240240
- name: Test with pytest
241241
run: |

docs/source/detoxifying_a_lm.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,10 @@ Our goal is to train models up to 6B parameters, which is about 24GB in float32!
9393
- Use `bfloat16` precision: Simply load your model in `bfloat16` when calling `from_pretrained` and you can reduce the size of the model by 2:
9494

9595
```python
96-
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", torch_dtype=torch.bfloat16)
96+
model = AutoModelForCausalLM.from_pretrained("EleutherAI/gpt-j-6B", dtype=torch.bfloat16)
9797
```
9898

99-
and the optimizer will take care of computing the gradients in `bfloat16` precision. Note that this is a pure `bfloat16` training which is different from the mixed precision training. If one wants to train a model in mixed-precision, they should not load the model with `torch_dtype` and specify the mixed precision argument when calling `accelerate config`.
99+
and the optimizer will take care of computing the gradients in `bfloat16` precision. Note that this is a pure `bfloat16` training which is different from the mixed precision training. If one wants to train a model in mixed-precision, they should not load the model with `dtype` and specify the mixed precision argument when calling `accelerate config`.
100100

101101
- Use shared layers: Since PPO algorithm requires to have both the active and reference model to be on the same device, we have decided to use shared layers to reduce the memory footprint of the model. This can be achieved by specifying `num_shared_layers` argument when calling the `create_reference_model()` function. For example, if you want to share the first 6 layers of the model, you can do it like this:
102102

docs/source/dpo_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -255,7 +255,7 @@ model = AutoModelForCausalLM.from_pretrained(
255255
load_in_4bit=True,
256256
quantization_config=bnb_config,
257257
attn_implementation="flash_attention_2",
258-
torch_dtype=torch.bfloat16,
258+
dtype=torch.bfloat16,
259259
device_map="auto",
260260
)
261261

docs/source/grpo_trainer.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -573,7 +573,7 @@ accelerate launch \
573573
--output_dir grpo-Qwen2.5-VL-3B-Instruct \
574574
--learning_rate 1e-5 \
575575
--gradient_checkpointing \
576-
--torch_dtype bfloat16 \
576+
--dtype bfloat16 \
577577
--max_prompt_length 2048 \
578578
--max_completion_length 1024 \
579579
--use_vllm \

docs/source/iterative_sft_trainer.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -79,7 +79,7 @@ from trl import IterativeSFTConfig
7979

8080
config = IterativeSFTConfig(
8181
# Model initialization parameters
82-
model_init_kwargs={"torch_dtype": "bfloat16"},
82+
model_init_kwargs={"dtype": "bfloat16"},
8383

8484
# Data preprocessing parameters
8585
max_length=512,
@@ -104,7 +104,7 @@ You can control how the model is initialized by passing keyword arguments to `mo
104104
```python
105105
config = IterativeSFTConfig(
106106
model_init_kwargs={
107-
"torch_dtype": "bfloat16",
107+
"dtype": "bfloat16",
108108
"device_map": "auto",
109109
"trust_remote_code": True,
110110
}

docs/source/sft_trainer.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -130,16 +130,16 @@ While training and evaluating we record the following reward metrics:
130130
You can directly pass the kwargs of the [`~transformers.AutoModelForCausalLM.from_pretrained()`] method to the [`SFTConfig`]. For example, if you want to load a model in a different precision, analogous to
131131

132132
```python
133-
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", torch_dtype=torch.bfloat16)
133+
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-0.6B", dtype=torch.bfloat16)
134134
```
135135

136-
you can do so by passing the `model_init_kwargs={"torch_dtype": torch.bfloat16}` argument to the [`SFTConfig`].
136+
you can do so by passing the `model_init_kwargs={"dtype": torch.bfloat16}` argument to the [`SFTConfig`].
137137

138138
```python
139139
from trl import SFTConfig
140140

141141
training_args = SFTConfig(
142-
model_init_kwargs={"torch_dtype": torch.bfloat16},
142+
model_init_kwargs={"dtype": torch.bfloat16},
143143
)
144144
```
145145

examples/research_projects/layer_skip/scripts/benchmark_layer_skip.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ def generate_tokens_with_assistance(model, inputs, assistant_early_exit):
4040
if __name__ == "__main__":
4141
ckpt = config.hub_model_id
4242

43-
model = AutoModelForCausalLM.from_pretrained(ckpt, device_map="auto", torch_dtype=torch.bfloat16)
43+
model = AutoModelForCausalLM.from_pretrained(ckpt, device_map="auto", dtype=torch.bfloat16)
4444
tokenizer = AutoTokenizer.from_pretrained(ckpt)
4545

4646
prompt = "### Instruction: What are my alarms for the rest of the day?\n ### Response: "

examples/research_projects/layer_skip/scripts/layer_skip_sft.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@ def formatting_prompts_func(example):
4343

4444
# load the model and tokenizer
4545
print("[INFO] loading the model and tokenizer...")
46-
model = AutoModelForCausalLM.from_pretrained(config.model_name, device_map="auto", torch_dtype=torch.bfloat16)
46+
model = AutoModelForCausalLM.from_pretrained(config.model_name, device_map="auto", dtype=torch.bfloat16)
4747
tokenizer = AutoTokenizer.from_pretrained(config.tokenizer_name, add_eos_token=True)
4848

4949
# adding pad and eos tokens if not provided in the tokenizer

examples/research_projects/stack_llama/scripts/merge_peft_adapter.py

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -42,12 +42,10 @@ class ScriptArguments:
4242
if peft_config.task_type == "SEQ_CLS":
4343
# The sequence classification task is used for the reward model in PPO
4444
model = AutoModelForSequenceClassification.from_pretrained(
45-
script_args.base_model_name, num_labels=1, torch_dtype=torch.bfloat16
45+
script_args.base_model_name, num_labels=1, dtype=torch.bfloat16
4646
)
4747
else:
48-
model = AutoModelForCausalLM.from_pretrained(
49-
script_args.base_model_name, return_dict=True, torch_dtype=torch.bfloat16
50-
)
48+
model = AutoModelForCausalLM.from_pretrained(script_args.base_model_name, return_dict=True, dtype=torch.bfloat16)
5149

5250
tokenizer = AutoTokenizer.from_pretrained(script_args.base_model_name)
5351

examples/research_projects/stack_llama/scripts/reward_modeling.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -168,9 +168,7 @@ class ScriptArguments:
168168
lora_dropout=0.1,
169169
)
170170

171-
model = AutoModelForSequenceClassification.from_pretrained(
172-
script_args.model_name, num_labels=1, torch_dtype=torch.bfloat16
173-
)
171+
model = AutoModelForSequenceClassification.from_pretrained(script_args.model_name, num_labels=1, dtype=torch.bfloat16)
174172
model = get_peft_model(model, peft_config)
175173
model.print_trainable_parameters()
176174

0 commit comments

Comments
 (0)