Skip to content

Commit 99fc3ac

Browse files
veezbostevhliu
andauthored
Modify efficient GPU training doc with now-available adamw_bnb_8bit optimizer (#25807)
* Modify single-GPU efficient training doc with now-available adamw_bnb_8bit optimizer * Apply suggestions from code review Co-authored-by: Steven Liu <[email protected]> --------- Co-authored-by: Steven Liu <[email protected]>
1 parent e95bcae commit 99fc3ac

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

docs/source/en/perf_train_gpu_one.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -237,10 +237,11 @@ For example if you have [NVIDIA/apex](https://github.com/NVIDIA/apex) installed,
237237
fastest training experience among all supported AdamW optimizers.
238238

239239
[`Trainer`] integrates a variety of optimizers that can be used out of box: `adamw_hf`, `adamw_torch`, `adamw_torch_fused`,
240-
`adamw_apex_fused`, `adamw_anyprecision` or `adafactor`. More optimizers can be plugged in via a third-party implementation.
240+
`adamw_apex_fused`, `adamw_anyprecision`, `adafactor`, or `adamw_bnb_8bit`. More optimizers can be plugged in via a third-party implementation.
241241

242-
Let's take a closer look at two alternatives to AdamW optimizer - Adafactor (available in Trainer), and 8bit BNB quantized
243-
optimizer (third-party implementation).
242+
Let's take a closer look at two alternatives to AdamW optimizer:
243+
1. `adafactor` which is available in [`Trainer`]
244+
2. `adamw_bnb_8bit` is also available in Trainer, but a third-party integration is provided below for demonstration.
244245

245246
For comparison, for a 3B-parameter model, like “t5-3b”:
246247
* A standard AdamW optimizer will need 24GB of GPU memory because it uses 8 bytes for each parameter (8*3 => 24GB)
@@ -269,7 +270,13 @@ Instead of aggregating optimizer states like Adafactor, 8-bit Adam keeps the ful
269270
means that it stores the state with lower precision and dequantizes it only for the optimization. This is similar to the
270271
idea behind mixed precision training.
271272

272-
To use the 8-bit optimizer, you need to install it separately and then pass it as a custom optimizer to the [`Trainer`].
273+
To use `adamw_bnb_8bit`, you simply need to set `optim="adamw_bnb_8bit"` in [`TrainingArguments`]:
274+
275+
```py
276+
training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args)
277+
```
278+
279+
However, we can also use a third-party implementation of the 8-bit optimizer for demonstration purposes to see how that can be integrated.
273280

274281
First, follow the installation guide in the GitHub [repo](https://github.com/TimDettmers/bitsandbytes) to install the `bitsandbytes` library
275282
that implements the 8-bit Adam optimizer.
@@ -311,13 +318,6 @@ adam_bnb_optim = bnb.optim.Adam8bit(
311318
)
312319
```
313320

314-
<Tip>
315-
316-
To use the 8-bit optimizer with an existing pretrained model, you need to make a change to the embedding layer.
317-
Read [this issue](https://github.com/huggingface/transformers/issues/14819) for more information.
318-
319-
</Tip>
320-
321321
Finally, pass the custom optimizer as an argument to the `Trainer`:
322322

323323
```py

0 commit comments

Comments
 (0)