Skip to content

Commit dff3cc4

Browse files
DarkLight1337lulmer
authored andcommitted
[VLM] Generalized prompt updates for multi-modal processor (vllm-project#13964)
Signed-off-by: DarkLight1337 <[email protected]> Signed-off-by: Louis Ulmer <[email protected]>
1 parent 1f86d56 commit dff3cc4

29 files changed

+635
-492
lines changed

docs/source/contributing/model/multimodal.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -720,13 +720,13 @@ def _get_mm_fields_config(
720720

721721
:::::
722722

723-
### Prompt replacements
723+
### Prompt updates
724724

725-
Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements` to
726-
return a list of {class}`~vllm.multimodal.processing.PromptReplacement` instances.
725+
Override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` to
726+
return a list of {class}`~vllm.multimodal.processing.PromptUpdate` instances.
727727

728-
Each {class}`~vllm.multimodal.processing.PromptReplacement` instance specifies a find-and-replace
729-
operation performed by the HF processor.
728+
Each {class}`~vllm.multimodal.processing.PromptUpdate` instance specifies an update operation
729+
(e.g.: insertion, replacement) performed by the HF processor.
730730

731731
::::{tab-set}
732732
:::{tab-item} Basic example: LLaVA
@@ -743,15 +743,15 @@ for sample in text:
743743
```
744744

745745
It simply repeats each input `image_token` a number of times equal to the number of placeholder feature tokens (`num_image_tokens`).
746-
Based on this, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements` as follows:
746+
Based on this, we override {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates` as follows:
747747

748748
```python
749-
def _get_prompt_replacements(
749+
def _get_prompt_updates(
750750
self,
751751
mm_items: MultiModalDataItems,
752752
hf_processor_mm_kwargs: Mapping[str, object],
753753
out_mm_kwargs: MultiModalKwargs,
754-
) -> list[PromptReplacement]:
754+
) -> Sequence[PromptUpdate]:
755755
hf_config = self.info.get_hf_config()
756756
image_token_id = hf_config.image_token_index
757757

@@ -859,7 +859,7 @@ prompt_tokens, prompts_length = _tokenize_prompts_with_image_and_batch(
859859
)
860860
```
861861

862-
To accommodate this, instead of a string you can return an instance of `PromptReplacementDetails`
862+
To accommodate this, instead of a string you can return an instance of `PromptUpdateDetails`
863863
with different `full` and `feature` attributes:
864864

865865
```python
@@ -878,7 +878,7 @@ def get_replacement_fuyu(item_idx: int):
878878
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
879879
[_NEWLINE_TOKEN_ID]) * nrows
880880

881-
return PromptReplacementDetails(
881+
return PromptUpdateDetails(
882882
full=image_tokens + [bos_token_id],
883883
features=image_tokens,
884884
)
@@ -888,12 +888,12 @@ Finally, noticing that the HF processor removes the `|ENDOFTEXT|` token from the
888888
we can search for it to conduct the replacement at the start of the string:
889889

890890
```python
891-
def _get_prompt_replacements(
891+
def _get_prompt_updates(
892892
self,
893893
mm_items: MultiModalDataItems,
894894
hf_processor_mm_kwargs: Mapping[str, object],
895895
out_mm_kwargs: MultiModalKwargs,
896-
) -> list[PromptReplacement]:
896+
) -> Sequence[PromptUpdate]:
897897
hf_config = self.info.get_hf_config()
898898
bos_token_id = hf_config.bos_token_id
899899
assert isinstance(bos_token_id, int)
@@ -913,7 +913,7 @@ def _get_prompt_replacements(
913913
image_tokens = ([_IMAGE_TOKEN_ID] * ncols +
914914
[_NEWLINE_TOKEN_ID]) * nrows
915915

916-
return PromptReplacementDetails(
916+
return PromptUpdateDetails(
917917
full=image_tokens + [bos_token_id],
918918
features=image_tokens,
919919
)

docs/source/design/mm_processing.md

Lines changed: 14 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,16 @@ To enable various optimizations in vLLM such as [chunked prefill](#chunked-prefi
66

77
Here are the main features of {class}`~vllm.multimodal.processing.BaseMultiModalProcessor`:
88

9-
## Prompt Replacement Detection
9+
## Prompt Update Detection
1010

11-
One of the main responsibilies of HF processor is to replace input placeholder tokens (e.g. `<image>` for a single image) with feature placeholder tokens (e.g. `<image><image>...<image>`, the number of which equals to the feature size). The information about which tokens have been replaced is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
11+
One of the main responsibilies of HF processor is to update the prompt with placeholder tokens. For example:
1212

13-
In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptReplacement` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements`. Given this specification, we can automatically detect whether HF has replaced the input placeholder tokens by checking whether the feature placeholder tokens exist in the prompt.
13+
- Insert feature placeholder tokens (e.g. `<image><image>...<image>`, the number of which equals to the feature size) at the start of the string.
14+
- Replace existing input placeholder tokens (e.g. `<image>` for a single image) with feature placeholder tokens (e.g. `<image><image>...<image>`, the number of which equals to the feature size).
15+
16+
The information about which tokens have been updated is key to finding the correspondence between placeholder feature tokens and multi-modal inputs.
17+
18+
In vLLM, this information is specified using {class}`~vllm.multimodal.processing.PromptUpdate` in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`. We can automatically detect whether HF has updated the prompt by checking the existence of the updated tokens.
1419

1520
## Tokenized Prompt Inputs
1621

@@ -22,7 +27,7 @@ Consider that HF processors follow these main steps:
2227

2328
1. Tokenize the text
2429
2. Process multi-modal inputs
25-
3. Perform prompt replacement
30+
3. Perform prompt updates
2631

2732
And we require that:
2833

@@ -44,21 +49,21 @@ Moreover, since the tokenized text has not passed through the HF processor, we h
4449

4550
We work around the first issue by requiring each model to define how to generate dummy text based on the number of multi-modal inputs, via {meth}`~vllm.multimodal.profiling.BaseDummyInputsBuilder.get_dummy_processor_inputs`. This lets us generate dummy text corresponding to the multi-modal inputs and input them together to obtain the processed multi-modal data.
4651

47-
(mm-automatic-prompt-replacement)=
52+
(mm-automatic-prompt-updating)=
4853

49-
### Automatic prompt replacement
54+
### Automatic prompt updating
5055

5156
We address the second issue by implementing model-agnostic code in
52-
{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_replacements` to automatically replace input placeholder tokens with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_replacements`.
57+
{meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_prompt_updates` to automatically update the prompt with feature placeholder tokens based on the specification outputted by {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._get_prompt_updates`.
5358

5459
### Summary
5560

56-
With the help of dummy text and automatic prompt replacement, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
61+
With the help of dummy text and automatic prompt updating, our multi-modal processor can finally accept both text and token prompts with multi-modal data. The detailed logic is shown in {meth}`~vllm.multimodal.processing.BaseMultiModalProcessor._apply_hf_processor_main`.
5762

5863
## Processor Output Caching
5964

6065
Some HF processors, such as the one for Qwen2-VL, are [very slow](gh-issue:9238). To alleviate this problem, we cache the multi-modal outputs of HF processor to avoid processing the same multi-modal input (e.g. image) again.
6166

6267
When new data is passed in, we first check which items are in the cache, and which ones are missing. The missing items are passed into the HF processor in a single batch and cached, before being merged with the existing items in the cache.
6368

64-
Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt replacement code, we apply [automatic prompt replacement](#mm-automatic-prompt-replacement) afterwards to keep the output tokens and multi-modal data consistent with each other.
69+
Since we only process the missing multi-modal data items, the number of input placeholder tokens no longer corresponds to the number of the multi-modal inputs, so they can't be passed alongside the text prompt to HF processor. Therefore, we process the text and multi-modal inputs separately, using [dummy text](#mm-dummy-text) to avoid HF errors. Since this skips HF's prompt updating code, we apply [automatic prompt updating](#mm-automatic-prompt-updating) afterwards to keep the output tokens and multi-modal data consistent with each other.

0 commit comments

Comments
 (0)