Skip to content

Commit 5c46757

Browse files
authored
RHAIIS 3.3.1 v0.13.0 cherry picks (vllm-project#388)
## Summary Cherry-pick upstream bug fixes for RHAIIS 3.3.1 onto `rhai/0.13.0`. All fixes are from upstream vLLM `main` and address critical bugs affecting RHAIIS 3.3.0. Other releases (3.2.2, EAx) will be done separately. **Jira Epic:** [INFERENG-4743](https://issues.redhat.com/browse/INFERENG-4743) ## Cherry-picked commits (chronological order) | # | Upstream PR | Jira | Summary | |---|------------|------|---------| | 1 | [vllm-project#30550](vllm-project#30550) | [INFERENG-5106](https://issues.redhat.com/browse/INFERENG-5106) | Support using chat template as custom score template for reranking models | | 2 | [vllm-project#31406](vllm-project#31406) | [INFERENG-4800](https://issues.redhat.com/browse/INFERENG-4800) | Add encoder-only/cross attention support to Triton Attention backend | | 3 | [vllm-project#34243](vllm-project#34243) | [INFERENG-4746](https://issues.redhat.com/browse/INFERENG-4746) | Fix Llama-4 attn quantization by correctly permuting scales for rope (int8, fp8) | | 4 | [vllm-project#34454](vllm-project#34454) | [INFERENG-5032](https://issues.redhat.com/browse/INFERENG-5032) | Fix structured output in multi-turn GPT-OSS (content:null with json_object) | | 5 | [vllm-project#34507](vllm-project#34507) | [INFERENG-5038](https://issues.redhat.com/browse/INFERENG-5038) | Fix fused MoE int32 overflow in stride*offset for large models | | 6 | [vllm-project#35085](vllm-project#35085) | [INFERENG-5028](https://issues.redhat.com/browse/INFERENG-5028) | Gracefully disable AllReduceFusionPass on GPUs without multicast support | | 7 | [vllm-project#35456](vllm-project#35456) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Replace assert with ValueError for response_format validation (completions) | | 8 | [vllm-project#35510](vllm-project#35510) | [INFERENG-5035](https://issues.redhat.com/browse/INFERENG-5035) | Add response_format validation to chat completions endpoint | ## Conflict resolutions <details> <summary><b>#1 — llama-nemotron-embed / score-template support (vllm-project#30550)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#2 — Triton Attention (vllm-project#31406)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly onto `rhai/0.13.0`. </details> <details> <summary><b>#3 — Llama-4 attn quant (vllm-project#34243)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly. 4 intermediate upstream commits touch `llama4.py` but the fix targets a self-contained block. </details> <details> <summary><b>vllm-project#4 — GPT-OSS multi-turn (vllm-project#34454)</b>: Clean cherry-pick, no conflicts</summary> Applied cleanly despite 3 intermediate upstream commits that refactored imports in `gptoss_reasoning_parser.py`. The fix logic (adding `eom_token_id` early-exit check in `is_reasoning_end`) was independent of the import changes. </details> <details> <summary><b>vllm-project#5 — Fused MoE int32 overflow (vllm-project#34507)</b>: Conflicts in 2 files</summary> **`vllm/model_executor/layers/fused_moe/fused_moe.py`**: ~30 intermediate upstream commits refactored `fused_moe_kernel` with conditional `naive_block_assignment` logic that doesn't exist in `rhai/0.13.0`. Resolved by keeping our simpler code and applying only the int64 cast fix: - `fused_moe_kernel_gptq_awq`: added `.to(tl.int64)` to `tl.load()` result - `fused_moe_kernel`: added `offs_token = offs_token.to(tl.int64)` before `token_mask` **`tests/kernels/moe/test_moe.py`**: Upstream test changes depend on `make_dummy_moe_config()` from intermediate refactors. Resolved by keeping our existing test code (no test changes). </details> <details> <summary><b>vllm-project#6 — AllReduceFusionPass multicast (vllm-project#35085)</b>: Conflict due to file rename + API change</summary> Upstream moved `collective_fusion.py` → `compilation/passes/fusion/allreduce_rms_fusion.py` and changed the API from `trtllm_create_ipc_workspace_for_all_reduce_fusion()` to `create_allreduce_fusion_workspace()`. Resolved by applying the try/except wrapper around our existing `trtllm_create_ipc_workspace_for_all_reduce_fusion()` call in `collective_fusion.py`. The error handling logic (catching RuntimeError with "multicast" in message, logging warning, returning early) is identical to upstream. </details> <details> <summary><b>vllm-project#7 — response_format validation for completions (vllm-project#35456)</b>: Conflict due to file restructuring</summary> Upstream split `protocol.py` into `completion/protocol.py` and `chat_completion/protocol.py`. Our branch still has the monolithic `protocol.py`. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/completion/protocol.py` - Manually adding `validate_response_format` model_validator to `CompletionRequest` in our `protocol.py` - Using `ValueError` instead of upstream's `VLLMValidationError` (which doesn't exist in our branch; `ValueError` is already handled as 400 Bad Request in `serving_engine.py`) - Test additions from upstream applied cleanly to `test_completion_error.py` </details> <details> <summary><b>vllm-project#8 — response_format validation for chat completions (vllm-project#35510)</b>: Conflict due to file restructuring</summary> Same file restructuring issue as vllm-project#6. Resolved by: - Removing the non-existent `vllm/entrypoints/openai/chat_completion/protocol.py` - Manually adding `validate_response_format` model_validator to `ChatCompletionRequest` in our `protocol.py` - Only accepting the `test_json_schema_response_format_missing_schema` test from the conflict (discarding ~140 lines of intermediate upstream tests that reference non-existent paths in our branch) </details> ## Test plan - [ ] Verify `llama-nemotron-embed-1b-v2` works correctly with the backported score-template / bidirectional model support - [ ] Verify Llama-4 quantized model loads correctly with int8/fp8 attention quantization - [ ] Verify GPT-OSS multi-turn chat with `json_object` response_format returns valid content - [ ] Verify large MoE models (e.g. Qwen3.5-397B) don't crash with int32 overflow - [ ] Verify MoE model loading on H200 GPUs (without multicast) gracefully falls back - [ ] Verify `response_format: {type: "json_schema"}` without `json_schema` field returns 400 (not 500) for both `/v1/completions` and `/v1/chat/completions` - [ ] Verify encoder models (e.g. Whisper) work with Triton attention backend on ROCm [INFERENG-4743]: https://redhat.atlassian.net/browse/INFERENG-4743?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4800]: https://redhat.atlassian.net/browse/INFERENG-4800?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-4746]: https://redhat.atlassian.net/browse/INFERENG-4746?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5032]: https://redhat.atlassian.net/browse/INFERENG-5032?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5038]: https://redhat.atlassian.net/browse/INFERENG-5038?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [INFERENG-5106]: https://redhat.atlassian.net/browse/INFERENG-5106?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
2 parents f4339c5 + 761d079 commit 5c46757

34 files changed

Lines changed: 1473 additions & 67 deletions

docs/models/supported_models.md

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -489,6 +489,7 @@ These models primarily support the [`LLM.embed`](./pooling_models.md#llmembed) A
489489
| `GteNewModel`<sup>C</sup> | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-base`, etc. | | |
490490
| `ModernBertModel`<sup>C</sup> | ModernBERT-based | `Alibaba-NLP/gte-modernbert-base`, etc. | | |
491491
| `NomicBertModel`<sup>C</sup> | Nomic BERT | `nomic-ai/nomic-embed-text-v1`, `nomic-ai/nomic-embed-text-v2-moe`, `Snowflake/snowflake-arctic-embed-m-long`, etc. | | |
492+
| `LlamaBidirectionalModel`<sup>C</sup> | Llama-based with bidirectional attention | `nvidia/llama-nemotron-embed-1b-v2`, etc. | ✅︎ | ✅︎ |
492493
| `LlamaModel`<sup>C</sup>, `LlamaForCausalLM`<sup>C</sup>, `MistralModel`<sup>C</sup>, etc. | Llama-based | `intfloat/e5-mistral-7b-instruct`, etc. | ✅︎ | ✅︎ |
493494
| `Qwen2Model`<sup>C</sup>, `Qwen2ForCausalLM`<sup>C</sup> | Qwen2-based | `ssmits/Qwen2-7B-Instruct-embed-base` (see note), `Alibaba-NLP/gte-Qwen2-7B-instruct` (see note), etc. | ✅︎ | ✅︎ |
494495
| `Qwen3Model`<sup>C</sup>, `Qwen3ForCausalLM`<sup>C</sup> | Qwen3-based | `Qwen/Qwen3-Embedding-0.6B`, etc. | ✅︎ | ✅︎ |
@@ -542,8 +543,9 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
542543
| `BertForSequenceClassification` | BERT-based | `cross-encoder/ms-marco-MiniLM-L-6-v2`, etc. | | |
543544
| `GemmaForSequenceClassification` | Gemma-based | `BAAI/bge-reranker-v2-gemma` (see note), etc. | ✅︎ | ✅︎ |
544545
| `GteNewForSequenceClassification` | mGTE-TRM (see note) | `Alibaba-NLP/gte-multilingual-reranker-base`, etc. | | |
545-
| `Qwen2ForSequenceClassification` | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ |
546-
| `Qwen3ForSequenceClassification` | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ |
546+
| `LlamaBidirectionalForSequenceClassification`<sup>C</sup> | Llama-based with bidirectional attention | `nvidia/llama-nemotron-rerank-1b-v2` (see note), etc. | ✅︎ | ✅︎ |
547+
| `Qwen2ForSequenceClassification`<sup>C</sup> | Qwen2-based | `mixedbread-ai/mxbai-rerank-base-v2` (see note), etc. | ✅︎ | ✅︎ |
548+
| `Qwen3ForSequenceClassification`<sup>C</sup> | Qwen3-based | `tomaarsen/Qwen3-Reranker-0.6B-seq-cls`, `Qwen/Qwen3-Reranker-0.6B` (see note), etc. | ✅︎ | ✅︎ |
547549
| `RobertaForSequenceClassification` | RoBERTa-based | `cross-encoder/quora-roberta-base`, etc. | | |
548550
| `XLMRobertaForSequenceClassification` | XLM-RoBERTa-based | `BAAI/bge-reranker-v2-m3`, etc. | | |
549551
| `*Model`<sup>C</sup>, `*ForCausalLM`<sup>C</sup>, etc. | Generative models | N/A | \* | \* |
@@ -561,6 +563,11 @@ These models primarily support the [`LLM.score`](./pooling_models.md#llmscore) A
561563
!!! note
562564
The second-generation GTE model (mGTE-TRM) is named `NewForSequenceClassification`. The name `NewForSequenceClassification` is too generic, you should set `--hf-overrides '{"architectures": ["GteNewForSequenceClassification"]}'` to specify the use of the `GteNewForSequenceClassification` architecture.
563565

566+
!!! note
567+
`nvidia/llama-nemotron-rerank-1b-v2` require a specific prompt format to work correctly.
568+
569+
Examples : [offline_using_template.py](../../examples/pooling/score/offline_using_template.py) [online_using_template.py](../../examples/pooling/score/online_using_template.py)
570+
564571
!!! note
565572
Load the official original `mxbai-rerank-v2` by using the following command.
566573

docs/serving/openai_compatible_server.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -642,6 +642,21 @@ You can find the documentation for cross encoder models at [sbert.net](https://w
642642

643643
Code example: [examples/pooling/score/openai_cross_encoder_score.py](../../examples/pooling/score/openai_cross_encoder_score.py)
644644

645+
#### Score Template
646+
647+
Some scoring models require a specific prompt format to work correctly. You can specify a custom score template using the `--chat-template` parameter (see [Chat Template](#chat-template)).
648+
649+
Score templates are supported for **cross-encoder** models only. If you are using an **embedding** model for scoring, vLLM does not apply a score template.
650+
651+
Like chat templates, the score template receives a `messages` list. For scoring, each message has a `role` attribute—either `"query"` or `"document"`. For the usual kind of point-wise cross-encoder, you can expect exactly two messages: one query and one document. To access the query and document content, use Jinja's `selectattr` filter:
652+
653+
- **Query**: `{{ (messages | selectattr("role", "eq", "query") | first).content }}`
654+
- **Document**: `{{ (messages | selectattr("role", "eq", "document") | first).content }}`
655+
656+
This approach is more robust than index-based access (`messages[0]`, `messages[1]`) because it selects messages by their semantic role. It also avoids assumptions about message ordering if additional message types are added to `messages` in the future.
657+
658+
Example template file: [examples/pooling/score/template/nemotron-rerank.jinja](../../examples/pooling/score/template/nemotron-rerank.jinja)
659+
645660
#### Single inference
646661

647662
You can pass a string to both `text_1` and `text_2`, forming a single sentence pair.
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
# ruff: noqa: E501
4+
from pathlib import Path
5+
6+
from vllm import LLM
7+
8+
model_name = "nvidia/llama-nemotron-rerank-1b-v2"
9+
10+
# Path to template file
11+
template_path = Path(__file__).parent / "template" / "nemotron-rerank.jinja"
12+
chat_template = template_path.read_text()
13+
14+
llm = LLM(model=model_name, runner="pooling", trust_remote_code=True)
15+
16+
query = "how much protein should a female eat?"
17+
documents = [
18+
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
19+
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
20+
"Calorie intake should not fall below 1,200 a day in women or 1,500 a day in men, except under the supervision of a health professional.",
21+
]
22+
23+
outputs = llm.score(query, documents, chat_template=chat_template)
24+
25+
print("-" * 30)
26+
print([output.outputs.score for output in outputs])
27+
print("-" * 30)
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
3+
# ruff: noqa: E501
4+
"""
5+
Example of using the rerank API with template.
6+
7+
run:
8+
vllm serve nvidia/llama-nemotron-rerank-1b-v2 --runner pooling --trust-remote-code --chat-template examples/pooling/score/template/nemotron-rerank.jinja
9+
"""
10+
11+
import json
12+
13+
import requests
14+
15+
url = "http://127.0.0.1:8000/rerank"
16+
17+
headers = {"accept": "application/json", "Content-Type": "application/json"}
18+
19+
query = "how much protein should a female eat?"
20+
documents = [
21+
"As a general guideline, the CDC's average requirement of protein for women ages 19 to 70 is 46 grams per day. But, as you can see from this chart, you'll need to increase that if you're expecting or training for a marathon. Check out the chart below to see how much protein you should be eating each day.",
22+
"Definition of summit for English Language Learners. : 1 the highest point of a mountain : the top of a mountain. : 2 the highest level. : 3 a meeting or series of meetings between the leaders of two or more governments.",
23+
"Calorie intake should not fall below 1,200 a day in women or 1,500 a day in men, except under the supervision of a health professional.",
24+
]
25+
26+
data = {
27+
"model": "nvidia/llama-nemotron-rerank-1b-v2",
28+
"query": query,
29+
"documents": documents,
30+
}
31+
32+
33+
def main():
34+
response = requests.post(url, headers=headers, json=data)
35+
36+
# Check the response
37+
if response.status_code == 200:
38+
print("Request successful!")
39+
print(json.dumps(response.json(), indent=2))
40+
else:
41+
print(f"Request failed with status code: {response.status_code}")
42+
print(response.text)
43+
44+
45+
if __name__ == "__main__":
46+
main()
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
question:{{ (messages | selectattr("role", "eq", "query") | first).content }}
2+
3+
passage:{{ (messages | selectattr("role", "eq", "document") | first).content }}

tests/entrypoints/openai/test_chat_error.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -225,3 +225,15 @@ async def mock_generate(*args, **kwargs):
225225
f"Expected error message in chunks: {chunks}"
226226
)
227227
assert chunks[-1] == "data: [DONE]\n\n"
228+
229+
230+
def test_json_schema_response_format_missing_schema():
231+
"""When response_format type is 'json_schema' but the json_schema field
232+
is not provided, request construction should raise a validation error
233+
so the API returns 400 instead of 500."""
234+
with pytest.raises(Exception, match="json_schema.*must be provided"):
235+
ChatCompletionRequest(
236+
model=MODEL_NAME,
237+
messages=[{"role": "user", "content": "hello"}],
238+
response_format={"type": "json_schema"},
239+
)

tests/entrypoints/openai/test_completion_error.py

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -214,3 +214,36 @@ async def mock_generate(*args, **kwargs):
214214
f"Expected error message in chunks: {chunks}"
215215
)
216216
assert chunks[-1] == "data: [DONE]\n\n"
217+
218+
219+
def test_json_schema_response_format_missing_schema():
220+
"""When response_format type is 'json_schema' but the json_schema field
221+
is not provided, request construction should raise a validation error
222+
so the API returns 400 instead of 500."""
223+
with pytest.raises(Exception, match="json_schema.*must be provided"):
224+
CompletionRequest(
225+
model=MODEL_NAME,
226+
prompt="Test prompt",
227+
max_tokens=10,
228+
response_format={"type": "json_schema"},
229+
)
230+
231+
232+
def test_negative_prompt_token_ids_nested():
233+
"""Negative token IDs in prompt (nested list) should raise validation error."""
234+
with pytest.raises(Exception, match="greater than or equal to 0"):
235+
CompletionRequest(
236+
model=MODEL_NAME,
237+
prompt=[[-1]],
238+
max_tokens=10,
239+
)
240+
241+
242+
def test_negative_prompt_token_ids_flat():
243+
"""Negative token IDs in prompt (flat list) should raise validation error."""
244+
with pytest.raises(Exception, match="greater than or equal to 0"):
245+
CompletionRequest(
246+
model=MODEL_NAME,
247+
prompt=[-1],
248+
max_tokens=10,
249+
)

tests/entrypoints/openai/test_gptoss_structural_tags_integration.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ def mock_tokenizer(self):
2525
"""Create a mock tokenizer."""
2626
tokenizer = Mock()
2727
tokenizer.encode = Mock(return_value=[1, 2, 3, 4, 5])
28+
tokenizer.vocab = {"<|end|>": 6}
2829
return tokenizer
2930

3031
@pytest.fixture

0 commit comments

Comments
 (0)