@@ -16,7 +16,7 @@ To input multi-modal data, follow this schema in {class}`vllm.inputs.PromptType`
1616- ` prompt ` : The prompt should follow the format that is documented on HuggingFace.
1717- ` multi_modal_data ` : This is a dictionary that follows the schema defined in {class}` vllm.multimodal.inputs.MultiModalDataDict ` .
1818
19- ### Image
19+ ### Image Inputs
2020
2121You can pass a single image to the ` 'image' ` field of the multi-modal dictionary, as shown in the following examples:
2222
@@ -120,20 +120,20 @@ for o in outputs:
120120 print (generated_text)
121121```
122122
123- ### Video
123+ ### Video Inputs
124124
125125You can pass a list of NumPy arrays directly to the ` 'video' ` field of the multi-modal dictionary
126126instead of using multi-image input.
127127
128128Full example: < gh-file:examples/offline_inference/vision_language.py >
129129
130- ### Audio
130+ ### Audio Inputs
131131
132132You can pass a tuple ` (array, sampling_rate) ` to the ` 'audio' ` field of the multi-modal dictionary.
133133
134134Full example: < gh-file:examples/offline_inference/audio_language.py >
135135
136- ### Embedding
136+ ### Embedding Inputs
137137
138138To input pre-computed embeddings belonging to a data type (i.e. image, video, or audio) directly to the language model,
139139pass a tensor of shape ` (num_items, feature_size, hidden_size of LM) ` to the corresponding field of the multi-modal dictionary.
@@ -211,7 +211,7 @@ The chat template can be inferred based on the documentation on the model's Hugg
211211For example, LLaVA-1.5 (` llava-hf/llava-1.5-7b-hf ` ) requires a chat template that can be found here: < gh-file:examples/template_llava.jinja >
212212:::
213213
214- ### Image
214+ ### Image Inputs
215215
216216Image input is supported according to [ OpenAI Vision API] ( https://platform.openai.com/docs/guides/vision ) .
217217Here is a simple example using Phi-3.5-Vision.
@@ -293,7 +293,7 @@ export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
293293
294294:::
295295
296- ### Video
296+ ### Video Inputs
297297
298298Instead of ` image_url ` , you can pass a video file via ` video_url ` . Here is a simple example using [ LLaVA-OneVision] ( https://huggingface.co/llava-hf/llava-onevision-qwen2-0.5b-ov-hf ) .
299299
@@ -356,7 +356,7 @@ export VLLM_VIDEO_FETCH_TIMEOUT=<timeout>
356356
357357:::
358358
359- ### Audio
359+ ### Audio Inputs
360360
361361Audio input is supported according to [ OpenAI Audio API] ( https://platform.openai.com/docs/guides/audio?audio-generation-quickstart-example=audio-in ) .
362362Here is a simple example using Ultravox-v0.5-1B.
@@ -460,77 +460,6 @@ export VLLM_AUDIO_FETCH_TIMEOUT=<timeout>
460460
461461:::
462462
463- ### Embedding
463+ ### Embedding Inputs
464464
465- vLLM's Embeddings API is a superset of OpenAI's [ Embeddings API] ( https://platform.openai.com/docs/api-reference/embeddings ) ,
466- where a list of chat ` messages ` can be passed instead of batched ` inputs ` . This enables multi-modal inputs to be passed to embedding models.
467-
468- :::{tip}
469- The schema of ` messages ` is exactly the same as in Chat Completions API.
470- You can refer to the above tutorials for more details on how to pass each type of multi-modal data.
471- :::
472-
473- Usually, embedding models do not expect chat-based input, so we need to use a custom chat template to format the text and images.
474- Refer to the examples below for illustration.
475-
476- Here is an end-to-end example using VLM2Vec. To serve the model:
477-
478- ``` bash
479- vllm serve TIGER-Lab/VLM2Vec-Full --task embed \
480- --trust-remote-code --max-model-len 4096 --chat-template examples/template_vlm2vec.jinja
481- ```
482-
483- :::{important}
484- Since VLM2Vec has the same model architecture as Phi-3.5-Vision, we have to explicitly pass ` --task embed `
485- to run this model in embedding mode instead of text generation mode.
486-
487- The custom chat template is completely different from the original one for this model,
488- and can be found here: < gh-file:examples/template_vlm2vec.jinja >
489- :::
490-
491- Since the request schema is not defined by OpenAI client, we post a request to the server using the lower-level ` requests ` library:
492-
493- ``` python
494- import requests
495-
496- image_url = " https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
497-
498- response = requests.post(
499- " http://localhost:8000/v1/embeddings" ,
500- json = {
501- " model" : " TIGER-Lab/VLM2Vec-Full" ,
502- " messages" : [{
503- " role" : " user" ,
504- " content" : [
505- {" type" : " image_url" , " image_url" : {" url" : image_url}},
506- {" type" : " text" , " text" : " Represent the given image." },
507- ],
508- }],
509- " encoding_format" : " float" ,
510- },
511- )
512- response.raise_for_status()
513- response_json = response.json()
514- print (" Embedding output:" , response_json[" data" ][0 ][" embedding" ])
515- ```
516-
517- Below is another example, this time using the ` MrLight/dse-qwen2-2b-mrl-v1 ` model.
518-
519- ``` bash
520- vllm serve MrLight/dse-qwen2-2b-mrl-v1 --task embed \
521- --trust-remote-code --max-model-len 8192 --chat-template examples/template_dse_qwen2_vl.jinja
522- ```
523-
524- :::{important}
525- Like with VLM2Vec, we have to explicitly pass ` --task embed ` .
526-
527- Additionally, ` MrLight/dse-qwen2-2b-mrl-v1 ` requires an EOS token for embeddings, which is handled
528- by a custom chat template: < gh-file:examples/template_dse_qwen2_vl.jinja >
529- :::
530-
531- :::{important}
532- Also important, ` MrLight/dse-qwen2-2b-mrl-v1 ` requires a placeholder image of the minimum image size for text query embeddings. See the full code
533- example below for details.
534- :::
535-
536- Full example: < gh-file:examples/online_serving/openai_chat_embedding_client_for_multimodal.py >
465+ TBD
0 commit comments