Enable DeepSpeed for image-to-text example (huggingface#1455)

schoi-habana · Luca-Calabria · commit d09919c6e95f · 2024-11-25T13:02:17.000+01:00
diff --git a/examples/image-to-text/README.md b/examples/image-to-text/README.md
@@ -396,6 +396,37 @@ QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_ENABLE_LAZY_COLLEC
 --flash_attention_recompute
 ```
 
+## Multi-HPU inference
+
+To enable multi-card inference, you must set the environment variable `PT_HPU_ENABLE_LAZY_COLLECTIVES=true`,
+
+### BF16 Inference with FusedSDPA on 8 HPUs
+
+Use the following commands to run Llava-v1.6-mistral-7b BF16 inference with FusedSDPA on 8 HPUs:
+```bash
+PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
+--image_path "https://llava-vl.github.io/static/images/view.jpg" \
+--use_hpu_graphs \
+--bf16 \
+--use_flash_attention \
+--flash_attention_recompute
+```
+
+### FP8 Inference with FusedSDPA on 8 HPUs
+
+Use the following commands to run Llava-v1.6-mistral-7b FP8 inference with FusedSDPA on 8 HPUs.
+Here is an example of measuring the tensor quantization statistics on Llava-v1.6-mistral-7b on 8 HPUs:
+```bash
+QUANT_CONFIG=./quantization_config/maxabs_measure.json PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
+--model_name_or_path llava-hf/llava-v1.6-mistral-7b-hf \
+--image_path "https://llava-vl.github.io/static/images/view.jpg" \
+--use_hpu_graphs \
+--bf16 \
+--use_flash_attention \
+--flash_attention_recompute
+```
+
 Here is an example of quantizing the model based on previous measurements for Llava-v1.6-mistral-7b on 8 HPUs:
 ```bash
 QUANT_CONFIG=./quantization_config/maxabs_quant.json PT_HPU_ENABLE_LAZY_COLLECTIVES=true python ../gaudi_spawn.py --use_deepspeed --world_size 8 run_pipeline.py \
diff --git a/examples/image-to-text/run_pipeline.py b/examples/image-to-text/run_pipeline.py
@@ -230,31 +230,17 @@ def main():
 
         htcore.hpu_set_env()
 
+    generator = pipeline(
+        "image-to-text",
+        model=args.model_name_or_path,
+        torch_dtype=model_dtype,
+        device="hpu",
+    )
+
     if args.world_size > 1:
-        import deepspeed
-
-        with deepspeed.OnDevice(dtype=model_dtype, device="cpu"):
-            model = AutoModelForVision2Seq.from_pretrained(args.model_name_or_path, torch_dtype=model_dtype)
-        if model_type == "mllama":
-            model.language_model = initialize_distributed_model(args, model.language_model, logger, model_dtype)
-        else:
-            model = initialize_distributed_model(args, model, logger, model_dtype)
-        generator = pipeline(
-            "image-to-text",
-            model=model,
-            config=args.model_name_or_path,
-            tokenizer=args.model_name_or_path,
-            image_processor=args.model_name_or_path,
-            torch_dtype=model_dtype,
-            device="hpu",
-        )
+        generator.model = initialize_distributed_model(args, generator.model, logger, model_dtype)
+
     else:
-        generator = pipeline(
-            "image-to-text",
-            model=args.model_name_or_path,
-            torch_dtype=model_dtype,
-            device="hpu",
-        )
         if args.use_hpu_graphs:
             from habana_frameworks.torch.hpu import wrap_in_hpu_graph