Skip to content

Commit 13b36ee

Browse files
authored
Multi gpu docs (#6550)
* multi gpu docs * Update process.mdx * Update process.mdx
1 parent 9849523 commit 13b36ee

File tree

1 file changed

+23
-10
lines changed

1 file changed

+23
-10
lines changed

docs/source/process.mdx

Lines changed: 23 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -351,24 +351,37 @@ Multiprocessing significantly speeds up processing by parallelizing processes on
351351
The [`~Dataset.map`] also works with the rank of the process if you set `with_rank=True`. This is analogous to the `with_indices` parameter. The `with_rank` parameter in the mapped function goes after the `index` one if it is already present.
352352

353353
```py
354-
>>> from multiprocess import set_start_method
355354
>>> import torch
356-
>>> import os
357-
>>>
358-
>>> for i in range(torch.cuda.device_count()): # send model to every GPU
359-
... model.to(f"cuda:{i}")
355+
>>> from multiprocess import set_start_method
356+
>>> from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
357+
>>> from datasets import load_dataset
358+
>>>
359+
>>> # Get an example dataset
360+
>>> dataset = load_dataset("fka/awesome-chatgpt-prompts", split="train")
361+
>>>
362+
>>> # Get an example model and its tokenizer
363+
>>> tokenizer = AutoTokenizer.from_pretrained("facebook/nllb-200-distilled-600M")
364+
>>> model = AutoModelForSeq2SeqLM.from_pretrained("facebook/nllb-200-distilled-600M")
360365
>>>
361366
>>> def gpu_computation(example, rank):
362-
... torch.cuda.set_device(f"cuda:{rank}") # use one GPU
363-
... # Your big GPU call goes here, for example
364-
... inputs = tokenizer(texts, truncation=True, return_tensors="pt").to(f"cuda:{rank}")
367+
... # Move the model on the right GPU if it's not there already
368+
... model.to(f"cuda:{rank or 0}")
369+
...
370+
... # Your big GPU call goes here, for example:
371+
... inputs = tokenizer(texts, padding=True, return_tensors="pt").to(f"cuda:{rank or 0}")
365372
... outputs = model.generate(**inputs)
366-
... example["generated_text"] = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)
373+
... example["translated"] = tokenizer.batch_decode(outputs, skip_special_tokens=True)
367374
... return example
368375
>>>
369376
>>> if __name__ == "__main__":
370377
... set_start_method("spawn")
371-
... updated_dataset = dataset.map(gpu_computation, with_rank=True, num_proc=torch.cuda.device_count())
378+
... updated_dataset = dataset.map(
379+
... gpu_computation,
380+
... with_rank=True,
381+
... num_proc=torch.cuda.device_count(), # one process per GPU
382+
... batched=True, # optional
383+
... batch_size=8, # optional
384+
... )
372385
```
373386

374387
The main use-case for rank is to parallelize computation across several GPUs. This requires setting `multiprocess.set_start_method("spawn")`. If you don't you'll receive the following CUDA error:

0 commit comments

Comments
 (0)