feat(core): add pooling model initial support for V1 engine #152

pei0033 · 2025-11-10T04:55:16Z

🚀 Summary of Changes

Implemented _pool() method in RBLNModelRunner for pooling model inference based on GPUModelRunner implementation
Adjusted warmup logic to handle pooling model initialization
Added comprehensive example scripts for Qwen3 embedding and reranker models

📌 Related Issues / Tickets

Resolves Support Pooling model #147

✅ Type of Change

✨ Feature (feature)
🧠 Model support (model)
🧬 Core engine changes (core)
🛠 Bug fix (bug-fix)
⚙️ Performance improvement (perf)
🔁 Refactor or code cleanup (refactor)
📄 Documentation (docs)
❓ Other (other): please describe

🧪 How to Test

For Qwen3 Embedding

Run the embedding example: RBLN_PROFILER=0 RBLN_KERNEL_MODE=triton VLLM_RBLN_USE_VLLM_MODEL=1 VLLM_USE_V1=1 python examples/experimental/qwen3_embedding.py
Verify that embedding vectors are generated for input texts
Expected output: Similarity scores between queries and documents.

For Qwen3 Reranker

Run the reranker example: RBLN_PROFILER=0 RBLN_KERNEL_MODE=triton VLLM_RBLN_USE_VLLM_MODEL=1 VLLM_USE_V1=1 python examples/experimental/qwen3_reranker.py
Verify that relevance scores are computed for query-document pairs
Expected output: Score list showing document relevance.

📋 Checklist

PR title follows Conventional Commits format
This PR is linked to an existing issue
The test method is described, and the expected result is clearly stated
Relevant documentation has been updated (if applicable)

💬 Notes

The implementation follows the same pattern as upstream vLLM's V1 engine pooling model support
Warmup logic has been adjusted to properly initialize pooling models
This is an initial implementation; additional model types (e.g., BERT-based models) may be added in future PRs.
The pooling function currently runs on CPU without RBLN compilation. Future optimization may include compiling this operation for better performance.

rebel-jiwoopark · 2025-11-28T08:48:11Z

vllm_rbln/v1/worker/rbln_model_runner.py

+        num_scheduled_tokens: int,
+        num_scheduled_tokens_np: np.ndarray,
+        kv_connector_output: Optional[KVConnectorOutput],
+    ) -> ModelRunnerOutput:


I’m curious whether any modifications were made from the GPU's _pool().

If there are any changes, I’d appreciate it if you could leave a comment on the corresponding code snippets.

The current initial implementation has been designed to run pooler on the CPU in the same manner as the GPU code.
However, certain poolers require significant computation(like lm head). Moving these to the rbln will likely be necessary for optimization.

rebel-jiwoopark · 2025-11-28T08:51:03Z

Can this implementation also cover pooling models other than Qwen3? (such as BERT, ...)

rebel-jiwoopark · 2025-11-28T08:56:00Z

@huijjj It would be great if you could review the warmup-related code.

pei0033 · 2025-12-03T04:37:05Z

Can this implementation also cover pooling models other than Qwen3? (such as BERT, ...)

I tested with other models such as BERT (BAAI/bge-base-en-v1.5) and Robert (sentence-transformers/all-roberta-large-v1), but encountered the error below when launching the engine.
It appears that additional implementation is needed to support encoder-based models.

Error log

(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 192, in _initialize_kv_caches
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     kv_cache_configs = [
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 193, in <listcomp>
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1112, in get_kv_cache_config
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     unify_hybrid_kv_cache_specs(kv_cache_spec)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1051, in unify_hybrid_kv_cache_specs
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     if is_kv_cache_type_uniform(kv_cache_spec):
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 762, in is_kv_cache_type_uniform
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     _ = kv_cache_spec_values[0].merge(kv_cache_spec_values)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] IndexError: list index out of range
(EngineCore_DP0 pid=373150) Process EngineCore_DP0:
(EngineCore_DP0 pid=373150) Traceback (most recent call last):
(EngineCore_DP0 pid=373150)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=373150)     self.run()
(EngineCore_DP0 pid=373150)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=373150)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=373150)     raise e
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=373150)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=373150)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=373150)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 192, in _initialize_kv_caches
(EngineCore_DP0 pid=373150)     kv_cache_configs = [
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 193, in <listcomp>
(EngineCore_DP0 pid=373150)     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1112, in get_kv_cache_config
(EngineCore_DP0 pid=373150)     unify_hybrid_kv_cache_specs(kv_cache_spec)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1051, in unify_hybrid_kv_cache_specs
(EngineCore_DP0 pid=373150)     if is_kv_cache_type_uniform(kv_cache_spec):
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 762, in is_kv_cache_type_uniform
(EngineCore_DP0 pid=373150)     _ = kv_cache_spec_values[0].merge(kv_cache_spec_values)
(EngineCore_DP0 pid=373150) IndexError: list index out of range

rebel-jaehwang · 2025-12-03T04:42:20Z

#167 might have fixed this issue.

pei0033 · 2025-12-03T07:12:42Z

#167 might have fixed this issue.

Applying #167 resolved the previous error, but I encountered a new one. It seems to be an issue related to the usage of EncoderOnlyAttentionBuilder from the vllm upstream.

Error Log

(EngineCore_DP0 pid=471500) Process EngineCore_DP0:
(EngineCore_DP0 pid=471500) Traceback (most recent call last):
(EngineCore_DP0 pid=471500)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=471500)     self.run()
(EngineCore_DP0 pid=471500)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=471500)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=471500)     raise e
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=471500)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=471500)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=471500)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 215, in _initialize_kv_caches
(EngineCore_DP0 pid=471500)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 74, in initialize_from_config
(EngineCore_DP0 pid=471500)     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=471500)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/utils/__init__.py", line 3060, in run_method
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_worker.py", line 225, in compile_or_warm_up_model
(EngineCore_DP0 pid=471500)     self.model_runner.warmup_model()
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1433, in warmup_model
(EngineCore_DP0 pid=471500)     self._execute_dummy_requests(dummy_prefill_requests,
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1531, in _execute_dummy_requests
(EngineCore_DP0 pid=471500)     self.execute_model(sched_output)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1688, in execute_model
(EngineCore_DP0 pid=471500)     max_query_len) = self._prepare_inputs(scheduler_output)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1007, in _prepare_inputs
(EngineCore_DP0 pid=471500)     attn_metadata_i = builder.build(
(EngineCore_DP0 pid=471500) TypeError: create_encoder_only_attention_backend.<locals>.EncoderOnlyAttentionBuilder.build() got an unexpected keyword argument 'num_prompt_tokens'

rebel-jaehwang · 2025-12-03T07:48:29Z

added potential fix. does it fix the issue?

pei0033 · 2025-12-03T09:13:34Z

added potential fix. does it fix the issue?

I am still encountering the same error.

(Pdb) builder
<vllm.attention.layers.encoder_only_attention.create_encoder_only_attention_backend.<locals>.EncoderOnlyAttentionBuilder object at 0x7a29ffa96ad0>
(Pdb) isinstance(builder, RBLNFlashAttentionMetadataBuilder)
True

As shown in the debug output above, the EncoderOnlyAttentionBuilder instance is inheriting from RBLNFlashAttentionMetadataBuilder.
Even if I manually adjust extra_attn_metadata_args, the upstream EncoderOnlyAttentionBuilder.build() calls super().build(), so it inevitably requires "num_prompt_tokens" and "positions".

rebel-jaehwang · 2025-12-04T01:38:22Z

hmm then I guess we need to change the signature of AttentionMetadataBuilder.build in the upstream. GDNAttentionMetadataBuilder would have the same problem in the upstream, though I'm not sure if there's pooling model with GDNA.

pei0033 added 2 commits November 6, 2025 09:28

feat: initial implementation pooling model

fe53971

add qwen3 reranker and embedding model examples

ec5bd4c

pei0033 requested review from huijjj, rebel-jaehwang and rebel-jiwoopark November 10, 2025 04:55

pei0033 self-assigned this Nov 10, 2025

pei0033 added the torch.compile torch.compile based implementation label Nov 10, 2025

rebel-jiwoopark reviewed Nov 28, 2025

View reviewed changes

Merge branch 'dev' into feat/pooling-model

7e4ed8a

pei0033 force-pushed the feat/pooling-model branch from 6884ee8 to 7e4ed8a Compare December 3, 2025 06:21

fix: guard rbln-specific extra_attn_metadata_args

8ad601d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(core): add pooling model initial support for V1 engine #152

feat(core): add pooling model initial support for V1 engine #152

Uh oh!

pei0033 commented Nov 10, 2025 •

edited

Loading

Uh oh!

rebel-jiwoopark Nov 28, 2025

Uh oh!

rebel-jiwoopark Nov 28, 2025

Uh oh!

pei0033 Dec 2, 2025

Uh oh!

rebel-jiwoopark commented Nov 28, 2025

Uh oh!

rebel-jiwoopark commented Nov 28, 2025

Uh oh!

pei0033 commented Dec 3, 2025

Uh oh!

rebel-jaehwang commented Dec 3, 2025

Uh oh!

pei0033 commented Dec 3, 2025 •

edited

Loading

Uh oh!

rebel-jaehwang commented Dec 3, 2025

Uh oh!

pei0033 commented Dec 3, 2025 •

edited

Loading

Uh oh!

rebel-jaehwang commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat(core): add pooling model initial support for V1 engine #152

Are you sure you want to change the base?

feat(core): add pooling model initial support for V1 engine #152

Uh oh!

Conversation

pei0033 commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🚀 Summary of Changes

📌 Related Issues / Tickets

✅ Type of Change

🧪 How to Test

📋 Checklist

💬 Notes

Uh oh!

rebel-jiwoopark Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

rebel-jiwoopark Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

pei0033 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

rebel-jiwoopark commented Nov 28, 2025

Uh oh!

rebel-jiwoopark commented Nov 28, 2025

Uh oh!

pei0033 commented Dec 3, 2025

Uh oh!

rebel-jaehwang commented Dec 3, 2025

Uh oh!

pei0033 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rebel-jaehwang commented Dec 3, 2025

Uh oh!

pei0033 commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rebel-jaehwang commented Dec 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pei0033 commented Nov 10, 2025 •

edited

Loading

pei0033 commented Dec 3, 2025 •

edited

Loading

pei0033 commented Dec 3, 2025 •

edited

Loading