[Feature] Improve embedding merge implementation

### 🚀 The feature, motivation and pitch

Currently merging text embeddings and multimodal embedding is done by checking `input_ids` in the current batch and scatter multimodal embeddings into where the placeholder ids are. See more details [here](https://github.com/ywang96/vllm/blob/main/vllm/model_executor/models/utils.py#L478C2-L478C32).

An alternative solution here that should be investigated is to gather this information from `mm_positions` of scheduled requests to form a batch-level mask (similar to `grammar_bitmask`)

### Alternatives

_No response_

### Additional context

_No response_

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature] Improve embedding merge implementation #23891

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature] Improve embedding merge implementation #23891

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions