Skip to content

Commit d09919c

Browse files
schoi-habanaLuca-Calabria
authored andcommitted
Enable DeepSpeed for image-to-text example (huggingface#1455)
1 parent 1cc4511 commit d09919c

2 files changed

Lines changed: 40 additions & 23 deletions

File tree

examples/image-to-text/README.md

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -396,6 +396,37 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_ENABLE_LAZY_COLLEC
396396
--flash_attention_recompute
397397
```
398398

399+
## Multi-HPU inference
400+
401+
To enable multi-card inference, you must set the environment variable `PT_HPU_ENABLE_LAZY_COLLECTIVES=true`,
402+
403+
### BF16 Inference with FusedSDPA on 8 HPUs
404+
405+
Use the following commands to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA on 8 HPUs:
406+
```bash
407+
PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
408+
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
409+
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
410+
--use_hpu_graphs \
411+
--bf16 \
412+
--use_flash_attention \
413+
--flash_attention_recompute
414+
```
415+
416+
### FP8 Inference with FusedSDPA on 8 HPUs
417+
418+
Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA on 8 HPUs.
419+
Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b on 8 HPUs:
420+
```bash
421+
QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
422+
--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
423+
--image_path "https://llava-vl.github.io/static/images/view.jpg" \
424+
--use_hpu_graphs \
425+
--bf16 \
426+
--use_flash_attention \
427+
--flash_attention_recompute
428+
```
429+
399430
Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b on 8 HPUs:
400431
```bash
401432
QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \

examples/image-to-text/run_pipeline.py

Lines changed: 9 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -230,31 +230,17 @@ def main():
230230

231231
htcore.hpu_set_env()
232232

233+
generator = pipeline(
234+
"image-to-text",
235+
model=args.model_name_or_path,
236+
torch_dtype=model_dtype,
237+
device="hpu",
238+
)
239+
233240
if args.world_size > 1:
234-
import deepspeed
235-
236-
with deepspeed.OnDevice(dtype=model_dtype, device="cpu"):
237-
model = AutoModelForVision2Seq.from_pretrained(args.model_name_or_path, torch_dtype=model_dtype)
238-
if model_type == "mllama":
239-
model.language_model = initialize_distributed_model(args, model.language_model, logger, model_dtype)
240-
else:
241-
model = initialize_distributed_model(args, model, logger, model_dtype)
242-
generator = pipeline(
243-
"image-to-text",
244-
model=model,
245-
config=args.model_name_or_path,
246-
tokenizer=args.model_name_or_path,
247-
image_processor=args.model_name_or_path,
248-
torch_dtype=model_dtype,
249-
device="hpu",
250-
)
241+
generator.model = initialize_distributed_model(args, generator.model, logger, model_dtype)
242+
251243
else:
252-
generator = pipeline(
253-
"image-to-text",
254-
model=args.model_name_or_path,
255-
torch_dtype=model_dtype,
256-
device="hpu",
257-
)
258244
if args.use_hpu_graphs:
259245
from habana_frameworks.torch.hpu import wrap_in_hpu_graph
260246

0 commit comments

Comments
 (0)