Skip to content

Conversation

@pei0033
Copy link
Collaborator

@pei0033 pei0033 commented Nov 10, 2025

🚀 Summary of Changes

  • Implemented _pool() method in RBLNModelRunner for pooling model inference based on GPUModelRunner implementation
  • Adjusted warmup logic to handle pooling model initialization
  • Added comprehensive example scripts for Qwen3 embedding and reranker models

📌 Related Issues / Tickets


✅ Type of Change

  • ✨ Feature (feature)
  • 🧠 Model support (model)
  • 🧬 Core engine changes (core)
  • 🛠 Bug fix (bug-fix)
  • ⚙️ Performance improvement (perf)
  • 🔁 Refactor or code cleanup (refactor)
  • 📄 Documentation (docs)
  • ❓ Other (other): please describe

🧪 How to Test

For Qwen3 Embedding

  1. Run the embedding example: RBLN_PROFILER=0 RBLN_KERNEL_MODE=triton VLLM_RBLN_USE_VLLM_MODEL=1 VLLM_USE_V1=1 python examples/experimental/qwen3_embedding.py
  2. Verify that embedding vectors are generated for input texts
  3. Expected output: Similarity scores between queries and documents.

For Qwen3 Reranker

  1. Run the reranker example: RBLN_PROFILER=0 RBLN_KERNEL_MODE=triton VLLM_RBLN_USE_VLLM_MODEL=1 VLLM_USE_V1=1 python examples/experimental/qwen3_reranker.py
  2. Verify that relevance scores are computed for query-document pairs
  3. Expected output: Score list showing document relevance.

📋 Checklist

  • PR title follows Conventional Commits format
  • This PR is linked to an existing issue
  • The test method is described, and the expected result is clearly stated
  • Relevant documentation has been updated (if applicable)

💬 Notes

  • The implementation follows the same pattern as upstream vLLM's V1 engine pooling model support
  • Warmup logic has been adjusted to properly initialize pooling models
  • This is an initial implementation; additional model types (e.g., BERT-based models) may be added in future PRs.
  • The pooling function currently runs on CPU without RBLN compilation. Future optimization may include compiling this operation for better performance.

@pei0033 pei0033 self-assigned this Nov 10, 2025
@pei0033 pei0033 added the torch.compile torch.compile based implementation label Nov 10, 2025
num_scheduled_tokens: int,
num_scheduled_tokens_np: np.ndarray,
kv_connector_output: Optional[KVConnectorOutput],
) -> ModelRunnerOutput:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m curious whether any modifications were made from the GPU's _pool().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are any changes, I’d appreciate it if you could leave a comment on the corresponding code snippets.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current initial implementation has been designed to run pooler on the CPU in the same manner as the GPU code.
However, certain poolers require significant computation(like lm head). Moving these to the rbln will likely be necessary for optimization.

@rebel-jiwoopark
Copy link
Collaborator

Can this implementation also cover pooling models other than Qwen3? (such as BERT, ...)

@rebel-jiwoopark
Copy link
Collaborator

@huijjj It would be great if you could review the warmup-related code.

@pei0033
Copy link
Collaborator Author

pei0033 commented Dec 3, 2025

Can this implementation also cover pooling models other than Qwen3? (such as BERT, ...)

I tested with other models such as BERT (BAAI/bge-base-en-v1.5) and Robert (sentence-transformers/all-roberta-large-v1), but encountered the error below when launching the engine.
It appears that additional implementation is needed to support encoder-based models.

Error log
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] EngineCore failed to start.
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] Traceback (most recent call last):
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 192, in _initialize_kv_caches
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     kv_cache_configs = [
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 193, in <listcomp>
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1112, in get_kv_cache_config
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     unify_hybrid_kv_cache_specs(kv_cache_spec)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1051, in unify_hybrid_kv_cache_specs
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     if is_kv_cache_type_uniform(kv_cache_spec):
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 762, in is_kv_cache_type_uniform
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718]     _ = kv_cache_spec_values[0].merge(kv_cache_spec_values)
(EngineCore_DP0 pid=373150) [vllm-rbln] ERROR 12-03 03:43:54 core.py:718] IndexError: list index out of range
(EngineCore_DP0 pid=373150) Process EngineCore_DP0:
(EngineCore_DP0 pid=373150) Traceback (most recent call last):
(EngineCore_DP0 pid=373150)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=373150)     self.run()
(EngineCore_DP0 pid=373150)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=373150)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=373150)     raise e
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=373150)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=373150)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=373150)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 192, in _initialize_kv_caches
(EngineCore_DP0 pid=373150)     kv_cache_configs = [
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 193, in <listcomp>
(EngineCore_DP0 pid=373150)     get_kv_cache_config(vllm_config, kv_cache_spec_one_worker,
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1112, in get_kv_cache_config
(EngineCore_DP0 pid=373150)     unify_hybrid_kv_cache_specs(kv_cache_spec)
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1051, in unify_hybrid_kv_cache_specs
(EngineCore_DP0 pid=373150)     if is_kv_cache_type_uniform(kv_cache_spec):
(EngineCore_DP0 pid=373150)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/core/kv_cache_utils.py", line 762, in is_kv_cache_type_uniform
(EngineCore_DP0 pid=373150)     _ = kv_cache_spec_values[0].merge(kv_cache_spec_values)
(EngineCore_DP0 pid=373150) IndexError: list index out of range

@rebel-jaehwang
Copy link
Contributor

#167 might have fixed this issue.

@pei0033 pei0033 force-pushed the feat/pooling-model branch from 6884ee8 to 7e4ed8a Compare December 3, 2025 06:21
@pei0033
Copy link
Collaborator Author

pei0033 commented Dec 3, 2025

#167 might have fixed this issue.

Applying #167 resolved the previous error, but I encountered a new one. It seems to be an issue related to the usage of EncoderOnlyAttentionBuilder from the vllm upstream.

Error Log
(EngineCore_DP0 pid=471500) Process EngineCore_DP0:
(EngineCore_DP0 pid=471500) Traceback (most recent call last):
(EngineCore_DP0 pid=471500)   File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=471500)     self.run()
(EngineCore_DP0 pid=471500)   File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=471500)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 722, in run_engine_core
(EngineCore_DP0 pid=471500)     raise e
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 709, in run_engine_core
(EngineCore_DP0 pid=471500)     engine_core = EngineCoreProc(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 505, in __init__
(EngineCore_DP0 pid=471500)     super().__init__(vllm_config, executor_class, log_stats,
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 91, in __init__
(EngineCore_DP0 pid=471500)     self._initialize_kv_caches(vllm_config)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/engine/core.py", line 215, in _initialize_kv_caches
(EngineCore_DP0 pid=471500)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/v1/executor/abstract.py", line 74, in initialize_from_config
(EngineCore_DP0 pid=471500)     self.collective_rpc("compile_or_warm_up_model")
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 58, in collective_rpc
(EngineCore_DP0 pid=471500)     answer = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/vllm/utils/__init__.py", line 3060, in run_method
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_worker.py", line 225, in compile_or_warm_up_model
(EngineCore_DP0 pid=471500)     self.model_runner.warmup_model()
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1433, in warmup_model
(EngineCore_DP0 pid=471500)     self._execute_dummy_requests(dummy_prefill_requests,
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1531, in _execute_dummy_requests
(EngineCore_DP0 pid=471500)     self.execute_model(sched_output)
(EngineCore_DP0 pid=471500)   File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore_DP0 pid=471500)     return func(*args, **kwargs)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1688, in execute_model
(EngineCore_DP0 pid=471500)     max_query_len) = self._prepare_inputs(scheduler_output)
(EngineCore_DP0 pid=471500)   File "/workspace/vllm-rbln/vllm_rbln/v1/worker/rbln_model_runner.py", line 1007, in _prepare_inputs
(EngineCore_DP0 pid=471500)     attn_metadata_i = builder.build(
(EngineCore_DP0 pid=471500) TypeError: create_encoder_only_attention_backend.<locals>.EncoderOnlyAttentionBuilder.build() got an unexpected keyword argument 'num_prompt_tokens'

@rebel-jaehwang
Copy link
Contributor

added potential fix. does it fix the issue?

@pei0033
Copy link
Collaborator Author

pei0033 commented Dec 3, 2025

added potential fix. does it fix the issue?

I am still encountering the same error.

(Pdb) builder
<vllm.attention.layers.encoder_only_attention.create_encoder_only_attention_backend.<locals>.EncoderOnlyAttentionBuilder object at 0x7a29ffa96ad0>
(Pdb) isinstance(builder, RBLNFlashAttentionMetadataBuilder)
True

As shown in the debug output above, the EncoderOnlyAttentionBuilder instance is inheriting from RBLNFlashAttentionMetadataBuilder.
Even if I manually adjust extra_attn_metadata_args, the upstream EncoderOnlyAttentionBuilder.build() calls super().build(), so it inevitably requires "num_prompt_tokens" and "positions".

@rebel-jaehwang
Copy link
Contributor

hmm then I guess we need to change the signature of AttentionMetadataBuilder.build in the upstream. GDNAttentionMetadataBuilder would have the same problem in the upstream, though I'm not sure if there's pooling model with GDNA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

torch.compile torch.compile based implementation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants