Skip to content

Commit c38cb69

Browse files
authored
🧘 Enhance markdown style (#4235)
1 parent 68ef15c commit c38cb69

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

42 files changed

+450
-512
lines changed

CONTRIBUTING.md

Lines changed: 76 additions & 106 deletions
Large diffs are not rendered by default.

docs/source/bco_trainer.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# BCO Trainer
22

3-
[![](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
3+
[![model badge](https://img.shields.io/badge/All_models-BCO-blue)](https://huggingface.co/models?other=bco,trl)
44

55
TRL supports the Binary Classifier Optimization (BCO).
66
The [BCO](https://huggingface.co/papers/2404.04656) authors train a binary classifier whose logit serves as a reward so that the classifier maps {prompt, chosen completion} pairs to 1 and {prompt, rejected completion} pairs to 0.
@@ -12,17 +12,16 @@ The [`BCOTrainer`] requires an [unpaired preference dataset](dataset_formats#unp
1212
The [`BCOTrainer`] supports both [conversational](dataset_formats#conversational) and [standard](dataset_formats#standard) dataset formats. When provided with a conversational dataset, the trainer will automatically apply the chat template to the dataset.
1313

1414
## Expected model format
15+
1516
The BCO trainer expects a model of `AutoModelForCausalLM`, compared to PPO that expects `AutoModelForCausalLMWithValueHead` for the value function.
1617

1718
## Using the `BCOTrainer`
1819

19-
For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response.
20+
For a detailed example have a look at the `examples/scripts/bco.py` script. At a high level we need to initialize the `BCOTrainer` with a `model` we wish to train and a reference `ref_model` which we will use to calculate the implicit rewards of the preferred and rejected response.
2021

2122
The `beta` refers to the hyperparameter of the implicit reward, and the dataset contains the 3 entries listed above. Note that the `model` and `ref_model` need to have the same architecture (ie decoder only or encoder-decoder).
2223

23-
24-
25-
```py
24+
```python
2625
training_args = BCOConfig(
2726
beta=0.1,
2827
)
@@ -35,9 +34,10 @@ bco_trainer = BCOTrainer(
3534
processing_class=tokenizer,
3635
)
3736
```
37+
3838
After this one can then call:
3939

40-
```py
40+
```python
4141
bco_trainer.train()
4242
```
4343

@@ -49,7 +49,7 @@ If the prompts in your desired and undesired datasets differ a lot, it is useful
4949

5050
Choose an embedding model and tokenizer:
5151

52-
```py
52+
```python
5353
embedding_model = AutoModel.from_pretrained(your_model_id)
5454
embedding_tokenizer = AutoTokenizer.from_pretrained(your_model_id)
5555

@@ -64,7 +64,7 @@ embedding_func = partial(embed_prompt, model=embedding_model)
6464

6565
Set `prompt_sample_size` to define how many prompts are selected to train the UDM classifier and start the training with the provided embedding function:
6666

67-
```py
67+
```python
6868
training_args = BCOConfig(
6969
beta=0.1,
7070
prompt_sample_size=512,

docs/source/best_of_n.md

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning
1+
# Best of N sampling: Alternative ways to get better model output without RL based fine-tuning
22

33
Within the extras module is the `best-of-n` sampler class that serves as an alternative method of generating better model output.
44
As to how it fares against the RL based fine-tuning, please look in the `examples` directory for a comparison example
@@ -8,7 +8,6 @@ As to how it fares against the RL based fine-tuning, please look in the `example
88
To get started quickly, instantiate an instance of the class with a model, a length sampler, a tokenizer and a callable that serves as a proxy reward pipeline that outputs reward scores for input queries
99

1010
```python
11-
1211
from transformers import pipeline, AutoTokenizer
1312
from trl import AutoModelForCausalLMWithValueHead
1413
from trl.core import LengthSampler
@@ -19,37 +18,29 @@ reward_pipe = pipeline("sentiment-analysis", model=reward_model, device=device)
1918
tokenizer = AutoTokenizer.from_pretrained(ref_model_name)
2019
tokenizer.pad_token = tokenizer.eos_token
2120

22-
2321
# callable that takes a list of raw text and returns a list of corresponding reward scores
2422
def queries_to_scores(list_of_strings):
2523
return [output["score"] for output in reward_pipe(list_of_strings)]
2624

2725
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler)
28-
29-
3026
```
3127

3228
And assuming you have a list/tensor of tokenized queries, you can generate better output by calling the `generate` method
3329

3430
```python
35-
3631
best_of_n.generate(query_tensors, device=device, **gen_kwargs)
37-
3832
```
33+
3934
The default sample size is 4, but you can change it at the time of instance initialization like so
4035

4136
```python
42-
4337
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, sample_size=8)
44-
4538
```
4639

4740
The default output is the result of taking the top scored output for each query, but you can change it to top 2 and so on by passing the `n_candidates` argument at the time of instance initialization
4841

4942
```python
50-
5143
best_of_n = BestOfNSampler(model, tokenizer, queries_to_scores, length_sampler=output_length_sampler, n_candidates=2)
52-
5344
```
5445

5546
There is the option of setting the generation settings (like `temperature`, `pad_token_id`) at the time of instance creation as opposed to when calling the `generate` method.

docs/source/clis.md

Lines changed: 15 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,11 @@
22

33
TRL provides a powerful command-line interface (CLI) to fine-tune large language models (LLMs) using methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and more. The CLI abstracts away much of the boilerplate, letting you launch training jobs quickly and reproducibly.
44

5+
## Commands
6+
57
Currently supported commands are:
68

7-
#### Training Commands
9+
### Training Commands
810

911
- `trl dpo`: fine-tune a LLM with DPO
1012
- `trl grpo`: fine-tune a LLM with GRPO
@@ -13,7 +15,7 @@ Currently supported commands are:
1315
- `trl rloo`: fine-tune a LLM with RLOO
1416
- `trl sft`: fine-tune a LLM with SFT
1517

16-
#### Other Commands
18+
### Other Commands
1719

1820
- `trl env`: get the system information
1921
- `trl vllm-serve`: serve a model with vLLM
@@ -197,22 +199,22 @@ trl reward --config reward_config.yaml
197199

198200
The `--accelerate_config` flag lets you easily configure distributed training with [🤗 Accelerate](https://github.com/huggingface/accelerate). This flag accepts either:
199201

200-
* the name of a predefined config profile (built into TRL), or
201-
* a path to a custom Accelerate YAML config file.
202+
- the name of a predefined config profile (built into TRL), or
203+
- a path to a custom Accelerate YAML config file.
202204

203205
#### Predefined Config Profiles
204206

205207
TRL provides several ready-to-use Accelerate configs to simplify common training setups:
206208

207-
| Name | Description |
208-
| ------------ | ----------------------------------- |
209-
| `fsdp1` | Fully Sharded Data Parallel Stage 1 |
210-
| `fsdp2` | Fully Sharded Data Parallel Stage 2 |
211-
| `zero1` | DeepSpeed ZeRO Stage 1 |
212-
| `zero2` | DeepSpeed ZeRO Stage 2 |
213-
| `zero3` | DeepSpeed ZeRO Stage 3 |
214-
| `multi_gpu` | Multi-GPU training |
215-
| `single_gpu` | Single-GPU training |
209+
| Name | Description |
210+
| --- | --- |
211+
| `fsdp1` | Fully Sharded Data Parallel Stage 1 |
212+
| `fsdp2` | Fully Sharded Data Parallel Stage 2 |
213+
| `zero1` | DeepSpeed ZeRO Stage 1 |
214+
| `zero2` | DeepSpeed ZeRO Stage 2 |
215+
| `zero3` | DeepSpeed ZeRO Stage 3 |
216+
| `multi_gpu` | Multi-GPU training |
217+
| `single_gpu` | Single-GPU training |
216218

217219
To use one of these, just pass the name to `--accelerate_config`. TRL will automatically load the corresponding config file from `trl/accelerate_config/`.
218220

docs/source/community_tutorials.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,6 @@ Community tutorials are made by active members of the Hugging Face community who
1818
| Preference Optimization | [`ORPOTrainer`] | Fine-tuning Llama 3 with ORPO combining instruction tuning and preference alignment | [Maxime Labonne](https://huggingface.co/mlabonne) | [Link](https://mlabonne.github.io/blog/posts/2024-04-19_Fine_tune_Llama_3_with_ORPO.html) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1eHNWg9gnaXErdAa8_mcvjMupbSS6rDvi) |
1919
| Instruction tuning | [`SFTTrainer`] | How to fine-tune open LLMs in 2025 with Hugging Face | [Philipp Schmid](https://huggingface.co/philschmid) | [Link](https://www.philschmid.de/fine-tune-llms-in-2025) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-llms-in-2025.ipynb) |
2020

21-
2221
### Videos
2322

2423
| Task | Title | Author | Video |
@@ -32,6 +31,7 @@ Community tutorials are made by active members of the Hugging Face community who
3231

3332
> [!WARNING]
3433
> The tutorial uses two deprecated features:
34+
>
3535
> - `SFTTrainer(..., tokenizer=tokenizer)`: Use `SFTTrainer(..., processing_class=tokenizer)` instead, or simply omit it (it will be inferred from the model).
3636
> - `setup_chat_format(model, tokenizer)`: Use `SFTConfig(..., chat_template_path="Qwen/Qwen3-0.6B")`, where `chat_template_path` specifies the model whose chat template you want to copy.
3737

docs/source/cpo_trainer.md

Lines changed: 8 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# CPO Trainer
22

3-
[![](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)
3+
[![model badge](https://img.shields.io/badge/All_models-CPO-blue)](https://huggingface.co/models?other=cpo,trl)
44

55
## Overview
66

@@ -98,15 +98,13 @@ To use this loss as described in the paper, we can set the `loss_type="alphapo"`
9898

9999
The CPO algorithm supports several loss functions. The loss function can be set using the `loss_type` parameter in the [`CPOConfig`]. The following loss functions are supported:
100100

101-
| `loss_type=` | Description |
102-
| -------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
103-
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model, and in fact, the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
104-
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
105-
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair, and thus the smaller the `beta`, the larger this gap is. As per the paper, the loss is averaged over log-likelihoods of the completion (unlike DPO, which is summed only). |
106-
| `"simpo"` | The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, simply set `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and `simpo_gamma` to a recommended value. |
107-
| `"alphapo"` | The [AlphaPO](https://huggingface.co/papers/2501.03884) method is also implemented in the [`CPOTrainer`]. This is syntactic sugar that automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`. AlphaPO applies a transformation to the reward function shape in the context of SimPO loss when the `alpha` parameter is non-zero. |
108-
109-
101+
| `loss_type=` | Description |
102+
| --- | --- |
103+
| `"sigmoid"` (default) | Given the preference data, we can fit a binary classifier according to the Bradley-Terry model, and in fact, the [DPO](https://huggingface.co/papers/2305.18290) authors propose the sigmoid loss on the normalized likelihood via the `logsigmoid` to fit a logistic regression. |
104+
| `"hinge"` | The [RSO](https://huggingface.co/papers/2309.06657) authors propose to use a hinge loss on the normalized likelihood from the [SLiC](https://huggingface.co/papers/2305.10425) paper. In this case, the `beta` is the reciprocal of the margin. |
105+
| `"ipo"` | The [IPO](https://huggingface.co/papers/2310.12036) authors provide a deeper theoretical understanding of the DPO algorithms and identify an issue with overfitting and propose an alternative loss. In this case, the `beta` is the reciprocal of the gap between the log-likelihood ratios of the chosen vs the rejected completion pair, and thus the smaller the `beta`, the larger this gap is. As per the paper, the loss is averaged over log-likelihoods of the completion (unlike DPO, which is summed only). |
106+
| `"simpo"` | The [SimPO](https://huggingface.co/papers/2405.14734) method is also implemented in the [`CPOTrainer`]. SimPO is an alternative loss that adds a reward margin, allows for length normalization, and does not use BC regularization. To use this loss, simply set `loss_type="simpo"` and `cpo_alpha=0.0` in the [`CPOConfig`] and `simpo_gamma` to a recommended value. |
107+
| `"alphapo"` | The [AlphaPO](https://huggingface.co/papers/2501.03884) method is also implemented in the [`CPOTrainer`]. This is syntactic sugar that automatically sets `loss_type="simpo"` and `cpo_alpha=0.0`. AlphaPO applies a transformation to the reward function shape in the context of SimPO loss when the `alpha` parameter is non-zero. |
110108

111109
### For Mixture of Experts Models: Enabling the auxiliary loss
112110

docs/source/customization.md

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
TRL is designed with modularity in mind so that users are able to efficiently customize the training loop for their needs. Below are some examples on how you can apply and test different techniques. Note: Although these examples use the DPOTrainer, the customization applies to most (if not all) trainers.
44

5-
6-
75
## Use different optimizers and schedulers
86

97
By default, the `DPOTrainer` creates a `torch.optim.AdamW` optimizer. You can create and define a different optimizer and pass it to `DPOTrainer` as follows:
@@ -84,11 +82,11 @@ trainer = DPOTrainer(
8482
trainer.train()
8583
```
8684

87-
## Pass 8-bit reference models
88-
85+
## Pass 8-bit reference models
86+
8987
Since `trl` supports all keyword arguments when loading a model from `transformers` using `from_pretrained`, you can also leverage `load_in_8bit` from `transformers` for more memory efficient fine-tuning.
9088

91-
Read more about 8-bit model loading in `transformers` [here](https://huggingface.co/docs/transformers/en/peft#load-in-8bit-or-4bit).
89+
Read more about 8-bit model loading in `transformers` [Load in 8bit or 4bit](https://huggingface.co/docs/transformers/en/peft#load-in-8bit-or-4bit).
9290

9391
```python
9492
from datasets import load_dataset

0 commit comments

Comments
 (0)