Skip to content

Commit 190f987

Browse files
CHORE DOC Migrate tips syntax (#2801)
Discussed internally
1 parent 6030f91 commit 190f987

File tree

11 files changed

+102
-177
lines changed

11 files changed

+102
-177
lines changed

docs/source/accelerate/deepspeed.md

Lines changed: 18 additions & 27 deletions
Original file line numberDiff line numberDiff line change
@@ -276,11 +276,8 @@ In the above example, the memory consumed per GPU is **36.6 GB**. Therefore, wha
276276
# Use PEFT and DeepSpeed with ZeRO3 and CPU Offloading for finetuning large models on a single GPU
277277
This section of guide will help you learn how to use our DeepSpeed [training script](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You'll configure the script to train a large model for conditional generation with ZeRO-3 and CPU Offload.
278278
279-
<Tip>
280-
281-
💡 To help you get started, check out our example training scripts for [causal language modeling](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py) and [conditional generation](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.
282-
283-
</Tip>
279+
> [!TIP]
280+
> 💡 To help you get started, check out our example training scripts for [causal language modeling](https://github.com/huggingface/peft/blob/main/examples/causal_language_modeling/peft_lora_clm_accelerate_ds_zero3_offload.py) and [conditional generation](https://github.com/huggingface/peft/blob/main/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py). You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.
284281
285282
## Configuration
286283
@@ -338,11 +335,8 @@ Let's dive a little deeper into the script so you can see what's going on, and u
338335
339336
Within the [`main`](https://github.com/huggingface/peft/blob/2822398fbe896f25d4dac5e468624dc5fd65a51b/examples/conditional_generation/peft_lora_seq2seq_accelerate_ds_zero3_offload.py#L103) function, the script creates an [`~accelerate.Accelerator`] class to initialize all the necessary requirements for distributed training.
340337
341-
<Tip>
342-
343-
💡 Feel free to change the model and dataset inside the `main` function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function.
344-
345-
</Tip>
338+
> [!TIP]
339+
> 💡 Feel free to change the model and dataset inside the `main` function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function.
346340
347341
The script also creates a configuration for the 🤗 PEFT method you're using, which in this case, is LoRA. The [`LoraConfig`] specifies the task type and important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different 🤗 PEFT method, make sure you replace `LoraConfig` with the appropriate [class](../package_reference/tuners).
348342

@@ -439,20 +433,17 @@ dataset['train'][label_column][:10]=['no complaint', 'no complaint', 'complaint'
439433
2. When using CPU offloading, the major gains from using PEFT to shrink the optimizer states and gradients to that of the adapter weights would be realized on CPU RAM and there won't be savings with respect to GPU memory.
440434
3. DeepSpeed Stage 3 and qlora when used with CPU offloading leads to more GPU memory usage when compared to disabling CPU offloading.
441435
442-
<Tip>
443-
444-
💡 When you have code that requires merging (and unmerging) of weights, try to manually collect the parameters with DeepSpeed Zero-3 beforehand:
445-
446-
```python
447-
import deepspeed
448-
449-
is_ds_zero_3 = ... # check if Zero-3
450-
451-
with deepspeed.zero.GatheredParameters(list(model.parameters()), enabled= is_ds_zero_3):
452-
model.merge_adapter()
453-
# do whatever is needed, then unmerge in the same context if unmerging is required
454-
...
455-
model.unmerge_adapter()
456-
```
457-
458-
</Tip>
436+
> [!TIP]
437+
> 💡 When you have code that requires merging (and unmerging) of weights, try to manually collect the parameters with DeepSpeed Zero-3 beforehand:
438+
>
439+
> ```python
440+
> import deepspeed
441+
>
442+
> is_ds_zero_3 = ... # check if Zero-3
443+
>
444+
> with deepspeed.zero.GatheredParameters(list(model.parameters()), enabled= is_ds_zero_3):
445+
> model.merge_adapter()
446+
> # do whatever is needed, then unmerge in the same context if unmerging is required
447+
> ...
448+
> model.unmerge_adapter()
449+
> ```

docs/source/conceptual_guides/adapter.md

Lines changed: 2 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -22,11 +22,8 @@ This guide will give you a brief overview of the adapter methods supported by PE
2222

2323
## Low-Rank Adaptation (LoRA)
2424

25-
<Tip>
26-
27-
LoRA is one of the most popular PEFT methods and a good starting point if you're just getting started with PEFT. It was originally developed for large language models but it is a tremendously popular training method for diffusion models because of its efficiency and effectiveness.
28-
29-
</Tip>
25+
> [!TIP]
26+
> LoRA is one of the most popular PEFT methods and a good starting point if you're just getting started with PEFT. It was originally developed for large language models but it is a tremendously popular training method for diffusion models because of its efficiency and effectiveness.
3027
3128
As mentioned briefly earlier, [LoRA](https://hf.co/papers/2106.09685) is a technique that accelerates finetuning large models while consuming less memory.
3229

docs/source/developer_guides/checkpoint.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -129,21 +129,15 @@ Let's break this down:
129129
- By default, LoRA isn't applied to BERT's embedding layer, so there are _no entries_ for `lora_A_embedding` and `lora_B_embedding`.
130130
- The keys of the `state_dict` always start with `"base_model.model."`. The reason is that, in PEFT, we wrap the base model inside a tuner-specific model (`LoraModel` in this case), which itself is wrapped in a general PEFT model (`PeftModel`). For this reason, these two prefixes are added to the keys. When converting to the PEFT format, it is required to add these prefixes.
131131

132-
<Tip>
133-
134-
This last point is not true for prefix tuning techniques like prompt tuning. There, the extra embeddings are directly stored in the `state_dict` without any prefixes added to the keys.
135-
136-
</Tip>
132+
> [!TIP]
133+
> This last point is not true for prefix tuning techniques like prompt tuning. There, the extra embeddings are directly stored in the `state_dict` without any prefixes added to the keys.
137134
138135
When inspecting the parameter names in the loaded model, you might be surprised to find that they look a bit different, e.g. `base_model.model.encoder.layer.0.attention.self.query.lora_A.default.weight`. The difference is the *`.default`* part in the second to last segment. This part exists because PEFT generally allows the addition of multiple adapters at once (using an `nn.ModuleDict` or `nn.ParameterDict` to store them). For example, if you add another adapter called "other", the key for that adapter would be `base_model.model.encoder.layer.0.attention.self.query.lora_A.other.weight`.
139136

140137
When you call [`~PeftModel.save_pretrained`], the adapter name is stripped from the keys. The reason is that the adapter name is not an important part of the model architecture; it is just an arbitrary name. When loading the adapter, you could choose a totally different name, and the model would still work the same way. This is why the adapter name is not stored in the checkpoint file.
141138

142-
<Tip>
143-
144-
If you call `save_pretrained("some/path")` and the adapter name is not `"default"`, the adapter is stored in a sub-directory with the same name as the adapter. So if the name is "other", it would be stored inside of `some/path/other`.
145-
146-
</Tip>
139+
> [!TIP]
140+
> If you call `save_pretrained("some/path")` and the adapter name is not `"default"`, the adapter is stored in a sub-directory with the same name as the adapter. So if the name is "other", it would be stored inside of `some/path/other`.
147141
148142
In some circumstances, deciding which values to add to the checkpoint file can become a bit more complicated. For example, in PEFT, DoRA is implemented as a special case of LoRA. If you want to convert a DoRA model to PEFT, you should create a LoRA checkpoint with extra entries for DoRA. You can see this in the `__init__` of the previous `LoraLayer` code:
149143

docs/source/developer_guides/custom_models.md

Lines changed: 5 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -48,12 +48,9 @@ class MLP(nn.Module):
4848

4949
This is a straightforward multilayer perceptron with an input layer, a hidden layer, and an output layer.
5050

51-
<Tip>
52-
53-
For this toy example, we choose an exceedingly large number of hidden units to highlight the efficiency gains
54-
from PEFT, but those gains are in line with more realistic examples.
55-
56-
</Tip>
51+
> [!TIP]
52+
> For this toy example, we choose an exceedingly large number of hidden units to highlight the efficiency gains
53+
> from PEFT, but those gains are in line with more realistic examples.
5754
5855
There are a few linear layers in this model that could be tuned with LoRA. When working with common 🤗 Transformers
5956
models, PEFT will know which layers to apply LoRA to, but in this case, it is up to us as a user to choose the layers.
@@ -272,11 +269,8 @@ peft_model = get_peft_model(base_model, config)
272269
# do training
273270
```
274271

275-
<Tip>
276-
277-
When you call [`get_peft_model`], you will see a warning because PEFT does not recognize the targeted module type. In this case, you can ignore this warning.
278-
279-
</Tip>
272+
> [!TIP]
273+
> When you call [`get_peft_model`], you will see a warning because PEFT does not recognize the targeted module type. In this case, you can ignore this warning.
280274
281275
By supplying a custom mapping, PEFT first checks the base model's layers against the custom mapping and dispatches to the custom LoRA layer type if there is a match. If there is no match, PEFT checks the built-in LoRA layer types for a match.
282276

docs/source/developer_guides/lora.md

Lines changed: 43 additions & 55 deletions
Original file line numberDiff line numberDiff line change
@@ -119,11 +119,8 @@ initialize_lora_eva_weights(peft_model, dataloader)
119119
```
120120
EVA works out of the box with bitsandbytes. Simply initialize the model with `quantization_config` and call [`initialize_lora_eva_weights`] as usual.
121121

122-
<Tip>
123-
124-
For further instructions on using EVA, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/eva_finetuning).
125-
126-
</Tip>
122+
> [!TIP]
123+
> For further instructions on using EVA, please refer to our [documentation](https://github.com/huggingface/peft/tree/main/examples/eva_finetuning).
127124
128125
### LoftQ
129126

@@ -158,11 +155,8 @@ At the moment, `replace_lora_weights_loftq` has these additional limitations:
158155
- Model files must be stored as a `safetensors` file.
159156
- Only bitsandbytes 4bit quantization is supported.
160157

161-
<Tip>
162-
163-
Learn more about how PEFT works with quantization in the [Quantization](quantization) guide.
164-
165-
</Tip>
158+
> [!TIP]
159+
> Learn more about how PEFT works with quantization in the [Quantization](quantization) guide.
166160
167161
### Rank-stabilized LoRA
168162

@@ -570,11 +564,8 @@ model.add_weighted_adapter(
570564
model.set_adapter(weighted_adapter_name)
571565
```
572566

573-
<Tip>
574-
575-
There are several supported methods for `combination_type`. Refer to the [documentation](../package_reference/lora#peft.LoraModel.add_weighted_adapter) for more details. Note that "svd" as the `combination_type` is not supported when using `torch.float16` or `torch.bfloat16` as the datatype.
576-
577-
</Tip>
567+
> [!TIP]
568+
> There are several supported methods for `combination_type`. Refer to the [documentation](../package_reference/lora#peft.LoraModel.add_weighted_adapter) for more details. Note that "svd" as the `combination_type` is not supported when using `torch.float16` or `torch.bfloat16` as the datatype.
578569
579570
Now, perform inference:
580571

@@ -792,43 +783,40 @@ model = create_arrow_model(
792783
```
793784
To encode general knowledge, GenKnowSub subtracts the average of the provided general adapters from each task-specific adapter once, before routing begins. Furthermore, the ability to add or remove adapters after calling ```create_arrow_model``` (as described in the Arrow section) is still supported in this case.
794785

795-
<Tip>
796-
797-
**Things to keep in mind when using Arrow + GenKnowSub:**
798-
799-
- All LoRA adapters (task-specific and general) must share the same ```rank``` and ```target_modules```.
800-
801-
- Any inconsistency in these settings will raise an error in ```create_arrow_model```.
802-
803-
- Having different scaling factors (```lora_alpha```) across task adapters is supported — Arrow handles them automatically.
804-
805-
- Merging the ```"arrow_router"``` is not supported, due to its dynamic routing behavior.
806-
807-
- In create_arrow_model, task adapters are loaded as ```task_i``` and general adapters as ```gks_j``` (where ```i``` and ```j``` are indices). The function ensures consistency of ```target_modules```, ```rank```, and whether adapters are applied to ```Linear``` or ```Linear4bit``` layers. It then adds the ```"arrow_router"``` module and activates it. Any customization of this process requires overriding ```create_arrow_model```.
808-
809-
- This implementation is compatible with 4-bit quantization (via bitsandbytes):
810-
811-
```py
812-
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
813-
import torch
814-
815-
# Quantisation config
816-
bnb_config = BitsAndBytesConfig(
817-
load_in_4bit=True,
818-
bnb_4bit_quant_type="nf4",
819-
bnb_4bit_compute_dtype=torch.bfloat16,
820-
bnb_4bit_use_double_quant=False,
821-
)
822-
823-
# Loading the model
824-
base_model = AutoModelForCausalLM.from_pretrained(
825-
"microsoft/Phi-3-mini-4k-instruct",
826-
torch_dtype=torch.bfloat16,
827-
device_map="auto",
828-
quantization_config=bnb_config,
829-
)
830-
831-
# Now call create_arrow_model() as we explained before.
832-
```
833-
834-
</Tip>
786+
> [!TIP]
787+
> **Things to keep in mind when using Arrow + GenKnowSub:**
788+
>
789+
> - All LoRA adapters (task-specific and general) must share the same ```rank``` and ```target_modules```.
790+
>
791+
> - Any inconsistency in these settings will raise an error in ```create_arrow_model```.
792+
>
793+
> - Having different scaling factors (```lora_alpha```) across task adapters is supported — Arrow handles them automatically.
794+
>
795+
> - Merging the ```"arrow_router"``` is not supported, due to its dynamic routing behavior.
796+
>
797+
> - In create_arrow_model, task adapters are loaded as ```task_i``` and general adapters as ```gks_j``` (where ```i``` and ```j``` are indices). The function ensures consistency of ```target_modules```, ```rank```, and whether adapters are applied to ```Linear``` or ```Linear4bit``` layers. It then adds the ```"arrow_router"``` module and activates it. Any customization of this process requires overriding ```create_arrow_model```.
798+
>
799+
> - This implementation is compatible with 4-bit quantization (via bitsandbytes):
800+
>
801+
> ```py
802+
> from transformers import AutoModelForCausalLM, BitsAndBytesConfig
803+
> import torch
804+
>
805+
> # Quantisation config
806+
> bnb_config = BitsAndBytesConfig(
807+
> load_in_4bit=True,
808+
> bnb_4bit_quant_type="nf4",
809+
> bnb_4bit_compute_dtype=torch.bfloat16,
810+
> bnb_4bit_use_double_quant=False,
811+
> )
812+
>
813+
> # Loading the model
814+
> base_model = AutoModelForCausalLM.from_pretrained(
815+
> "microsoft/Phi-3-mini-4k-instruct",
816+
> torch_dtype=torch.bfloat16,
817+
> device_map="auto",
818+
> quantization_config=bnb_config,
819+
> )
820+
>
821+
> # Now call create_arrow_model() as we explained before.
822+
> ```

docs/source/developer_guides/troubleshooting.md

Lines changed: 6 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -71,11 +71,8 @@ trainer = Trainer(model=peft_model, fp16=True, ...)
7171
trainer.train()
7272
```
7373

74-
<Tip>
75-
76-
Starting from PEFT version v0.12.0, PEFT automatically promotes the dtype of adapter weights from `torch.float16` and `torch.bfloat16` to `torch.float32` where appropriate. To _prevent_ this behavior, you can pass `autocast_adapter_dtype=False` to [`~get_peft_model`], to [`~PeftModel.from_pretrained`], and to [`~PeftModel.load_adapter`].
77-
78-
</Tip>
74+
> [!TIP]
75+
> Starting from PEFT version v0.12.0, PEFT automatically promotes the dtype of adapter weights from `torch.float16` and `torch.bfloat16` to `torch.float32` where appropriate. To _prevent_ this behavior, you can pass `autocast_adapter_dtype=False` to [`~get_peft_model`], to [`~PeftModel.from_pretrained`], and to [`~PeftModel.load_adapter`].
7976
8077
### Selecting the dtype of the adapter
8178

@@ -137,11 +134,8 @@ You should probably TRAIN this model on a down-stream task to be able to use it
137134

138135
The mentioned layers should be added to `modules_to_save` in the config to avoid the described problem.
139136

140-
<Tip>
141-
142-
As an example, when loading a model that is using the DeBERTa architecture for sequence classification, you'll see a warning that the following weights are newly initialized: `['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']`. From this, it follows that the `classifier` and `pooler` layers should be added to: `modules_to_save=["classifier", "pooler"]`.
143-
144-
</Tip>
137+
> [!TIP]
138+
> As an example, when loading a model that is using the DeBERTa architecture for sequence classification, you'll see a warning that the following weights are newly initialized: `['classifier.bias', 'classifier.weight', 'pooler.dense.bias', 'pooler.dense.weight']`. From this, it follows that the `classifier` and `pooler` layers should be added to: `modules_to_save=["classifier", "pooler"]`.
145139
146140
### Extending the vocabulary
147141

@@ -345,11 +339,8 @@ TunerModelStatus(
345339

346340
Loading adapters like LoRA weights should generally be fast compared to loading the base model. However, there can be use cases where the adapter weights are quite large or where users need to load a large number of adapters -- the loading time can add up in this case. The reason for this is that the adapter weights are first initialized and then overridden by the loaded weights, which is wasteful. To speed up the loading time, you can pass the `low_cpu_mem_usage=True` argument to [`~PeftModel.from_pretrained`] and [`~PeftModel.load_adapter`].
347341

348-
<Tip>
349-
350-
If this option works well across different use cases, it may become the default for adapter loading in the future.
351-
352-
</Tip>
342+
> [!TIP]
343+
> If this option works well across different use cases, it may become the default for adapter loading in the future.
353344
354345

355346
## Reproducibility

docs/source/quicktour.md

Lines changed: 4 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -36,11 +36,8 @@ from peft import LoraConfig, TaskType
3636
peft_config = LoraConfig(task_type=TaskType.SEQ_2_SEQ_LM, inference_mode=False, r=8, lora_alpha=32, lora_dropout=0.1)
3737
```
3838

39-
<Tip>
40-
41-
See the [`LoraConfig`] reference for more details about other parameters you can adjust, such as the modules to target or the bias type.
42-
43-
</Tip>
39+
> [!TIP]
40+
> See the [`LoraConfig`] reference for more details about other parameters you can adjust, such as the modules to target or the bias type.
4441
4542
Once the [`LoraConfig`] is setup, create a [`PeftModel`] with the [`get_peft_model`] function. It takes a base model - which you can load from the Transformers library - and the [`LoraConfig`] containing the parameters for how to configure a model for training with LoRA.
4643

@@ -124,11 +121,8 @@ Both methods only save the extra PEFT weights that were trained, meaning it is s
124121

125122
## Inference
126123

127-
<Tip>
128-
129-
Take a look at the [AutoPeftModel](package_reference/auto_class) API reference for a complete list of available `AutoPeftModel` classes.
130-
131-
</Tip>
124+
> [!TIP]
125+
> Take a look at the [AutoPeftModel](package_reference/auto_class) API reference for a complete list of available `AutoPeftModel` classes.
132126
133127
Easily load any PEFT-trained model for inference with the [`AutoPeftModel`] class and the [`~transformers.PreTrainedModel.from_pretrained`] method:
134128

0 commit comments

Comments
 (0)