Added fix for bge reranker and e2e test to cover this case by AbhishekG4 · Pull Request #862 · torch-spyre/sendnn-inference

AbhishekG4 · 2026-03-24T20:04:21Z

Description

This change fixes a bug preventing reranker models with large max context lengths from starting up.
Vllm has a check that expects the enable_chunked_prefill flag to be set if max_model_len > max_num_batched_tokens. Thus, this fix enables chunked prefill by default.

Related Issues

Fixes #851

Test Plan

There is already testing infrastructure to test reranker models e2e. I simply added the bge-reranker-v2-m3 to the list of default rerankers to test which will cover this case. I observe that the new tests are failing without the changes and successfully start the server with the changes.

Checklist

I have read the contributing guidelines
My code follows the project's code style (run bash format.sh)
I have added tests for my changes (if applicable)
I have updated the documentation (if applicable)
My commits include a Signed-off-by: line (DCO compliance)

github-actions · 2026-03-24T20:12:08Z

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

tjohnson31415 · 2026-03-24T21:05:23Z

    if isScoring:
        model = REFERENCE_MODELS["cross-encoder/stsb-roberta-large"]
-        return [pytest.param(model, marks=[pytest.mark.scoring], id=model.name)]
+        bge_reranker_v2_m3_model = REFERENCE_MODELS["BAAI/bge-reranker-v2-m3"]


@maxdebayser Tagging you since I'm not very familiar with encoder model families: Do you think it is ok to use this reranker model in our scoring tests here which use CrossEncoder or should we have a separate test case for this reranker model?

Yes, it's ok from a functional point of view. But the model is a bit bigger than the stsb-roberta-large model and might be a bit slower to run in out tests.

Yeah, the tests are failing because the model is not in our cache. We have very limited resources available for the CI tests. I think it's better to revert the changes in this file.

Ok. That's interesting because one test for the bge-reranker passes and the other fails only due to a failed assertion. Is there any other way I can add a test that you would recommend or should I leave it at that?

Since we test this model in the pele test suite, I think we can leave it at that.

tjohnson31415 · 2026-03-24T21:09:08Z

        if parser is not None:
            parser.set_defaults(enable_prefix_caching=True)
            parser.set_defaults(max_num_batched_tokens=cls.DEFAULT_CHUNK_SIZE)
+            parser.set_defaults(enable_chunked_prefill=True)


We don't currently use the value, but I think it would be clearer to future devs to set enable_chunked_prefill to False down in check_and_update_config since we don't actually support chunked prefill for pooling/encoder models.

Also, please add a comment here and where we set it back to False noting why we are making this change.

I see this log pop up confirming the change:

(EngineCore_DP0 pid=631) INFO 03-25 21:28:57 [platform.py:318] Configurations for Spyre. max_model_len=512, max_num_seqs=4, block_size=512, max_num_batched_tokens=2048, enable_chunked_prefill=False, enable_prefix_caching=True

tjohnson31415 · 2026-03-25T21:39:36Z

        cache_config = vllm_config.cache_config

+        # unsetting this config as it was only set to pass vllm scheduler's max_model_len check
+        vllm_config.enable_chunked_prefill = False 


It would be confusing to set this to False when using chunked prefill with a decoder model.

Let's move this down further into a conditional that ensures we are only setting False when not using a decoder model. I'm thinking around line 246.

@AbhishekG4 , can you also disable prefix caching? In the diff in my other comment the right location is shown.

done. Was not sure whether to add a comment along with it or not. Let me know if I should.

maxdebayser

The scheduler check that is raising the error is:

    def verify_max_model_len(self, max_model_len: int) -> Self:
        if (
            self.max_num_batched_tokens < max_model_len
            and not self.enable_chunked_prefill
        ):
            raise ValueError(
                f"max_num_batched_tokens ({self.max_num_batched_tokens}) is "
                f"smaller than max_model_len ({max_model_len}). "
                "This effectively limits the maximum sequence length to "
                "max_num_batched_tokens and makes vLLM reject longer "
                "sequences. Please increase max_num_batched_tokens or "
                "decrease max_model_len."
            )

Setting enable_chunked_prefill = True makes the check pass but is not correct for embedding and reranker model which for the most part don't support chunked prefill. For these models, max_num_batched_tokens should be set to max_model_len. vLLM should do that already, but it seems that something in out platform.py logic is preventing it.

tjohnson31415 · 2026-03-26T13:44:15Z

@maxdebayser

For these models, max_num_batched_tokens should be set to max_model_len. vLLM should do that already, but it seems that something in out platform.py logic is preventing it.

The error arises because we set max_num_batched_tokens in the CLI args for embedding models too (there's no model info to query at that point). vLLM sees that as a user-provided setting and does not override it and then errors when checked vs max_model_len. We discussed with @joerunde and decided that setting enable_chunked_prefill was a simple-enough workaround.

maxdebayser · 2026-03-26T15:16:46Z

@tjohnson31415 , ok, I see now where this is coming from. In this case I think we need to disable these two flags because they are reaching the model runner enabled:

diff --git a/vllm_spyre/platform.py b/vllm_spyre/platform.py
index 9ac698a..27142b3 100644
--- a/vllm_spyre/platform.py
+++ b/vllm_spyre/platform.py
@@ -246,6 +246,7 @@ class SpyrePlatform(Platform):
             scheduler_config.max_num_seqs = max_batch_size
 
             scheduler_config.scheduler_cls = "vllm_spyre.v1.core.scheduler.PoolingSpyreScheduler"
+            vllm_config.scheduler_config.enable_chunked_prefill = False
 
         # Apply model-specific configurations using the registry
         # Only when running on Spyre device (sendnn backend)
@@ -299,6 +300,7 @@ class SpyrePlatform(Platform):
                     model_config.max_model_len * scheduler_config.max_num_seqs
                 )
                 cache_config.block_size = model_config.max_model_len  # ty: ignore[invalid-assignment]
+                vllm_config.cache_config.enable_prefix_caching = False
 
             else:
                 cache_config.block_size = cls._block_size

I also quickly tried an alternative but I haven't run any other test but the cli test so far: #870 . I expect more things to fail due to unintended consequences of setting other defaults.

maxdebayser

LGTM. Feel free to mark it ready for review.

AbhishekG4 · 2026-03-26T20:30:18Z

@maxdebayser @tjohnson31415 there is another issue which is that one of the tests (when I run test_spyre_scoring) fails due to exceeding tolerance levels

FAILED test_spyre_scoring.py::test_serving[sendnn-warmup_shapes0-BAAI/bge-reranker-v2-m3] - assert np.float32(0.05998862) == 0.05717248097...1 ± 0.00114345

maxdebayser · 2026-03-26T20:58:14Z

@AbhishekG4 , don't worry about the score not passing, in the pele test suite we switched from a relative tolerance to an absolute tolerance for this model because the relative comparison doesn't work very well for values between 0 and 1. It's better to remove this model from the CI tests.

tjohnson31415 · 2026-03-26T22:57:38Z

@maxdebayser @AbhishekG4: We do still need a test that validates this behavior and it has to be pretty end-to-end to validate the interaction with the vllm logic too / interaction with CLI flags. Could we just change the tolerances to be absolute? Or we could avoid adding the model by using cross-encoder/stsb-roberta-large still but in a separate test where the configuration is such that it throws the same error without the fix from the PR.

@AbhishekG4 Also, note the failing DCO check:
https://github.com/vllm-project/vllm-spyre/pull/862/checks?check_run_id=68794377418
You will need to sign-off on your commits. If you click the link it includes instructions on how to add the signatures.

Will need to run the formatter too.

maxdebayser · 2026-03-30T17:47:18Z

@tjohnson31415 @AbhishekG4 , I've come up with a workaround to test this behavior without requiring weight or config files in the CI cache. It's not particularly elegant, but works: main...maxdebayser:vllm-spyre:bge_test

joerunde · 2026-03-30T17:54:53Z

Thanks @maxdebayser! I think that's exactly the sort of test that we need that tests loading a config without needing to download the whole model

AbhishekG4 · 2026-03-30T18:06:56Z

Great! I will add this then. Was trying some other routes that didn't work last week

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

…voiding decoder models Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

AbhishekG4 · 2026-03-30T20:50:46Z

@tjohnson31415 @maxdebayser I've added the changes and test them. Everything is working. DCO seems to pass now too, anything else catches your eye?

joerunde · 2026-03-30T20:57:38Z

@AbhishekG4 you'll need to run the formatter, there's a helpful ./format.sh to run or you can set up prek locally.

Also the ty check is complaining that vllm_config.enable_chunked_prefill is not a real attribute, it should probably be vllm_config.cache_config.enable_prefix_caching.

See the failures here: https://github.com/vllm-project/vllm-spyre/actions/runs/23767093511/job/69249427321?pr=862

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

maxdebayser

LGTM

joerunde · 2026-03-31T14:27:35Z

bot:test
MARKERS="spyre and not decoder"

joerunde · 2026-03-31T14:28:11Z

This should be good to merge, just running all the encoder tests on spyre to double check that nothing weird breaks.

AbhishekG4 · 2026-03-31T15:42:23Z

Thank you all for your support and patience, I learned a lot of little things with this being my first time contributing here.

joerunde · 2026-03-31T18:40:06Z

Thanks for the contribution @AbhishekG4!

AbhishekG4 requested review from nikolaospapandreou, prashantgupta24, rafvasq, sducouedic, tdoublep and yannicks1 as code owners March 24, 2026 20:04

AbhishekG4 marked this pull request as draft March 24, 2026 20:13

AbhishekG4 mentioned this pull request Mar 24, 2026

[Bug]: Reranker vllm server failing on RHAIIS 3.5 image #851

Closed

1 task

AbhishekG4 force-pushed the reranker-fix branch from 4c0ecc8 to 30a8a88 Compare March 24, 2026 20:55

tjohnson31415 reviewed Mar 24, 2026

View reviewed changes

AbhishekG4 force-pushed the reranker-fix branch from 30a8a88 to 50dcc9d Compare March 25, 2026 21:33

tjohnson31415 reviewed Mar 25, 2026

View reviewed changes

maxdebayser reviewed Mar 26, 2026

View reviewed changes

Comment thread vllm_spyre/platform.py Outdated

maxdebayser requested changes Mar 26, 2026

View reviewed changes

maxdebayser approved these changes Mar 26, 2026

View reviewed changes

Abhishek Gautam and others added 5 commits March 30, 2026 15:24

Added fix for bge reranker and e2e test to cover this case

8f7ef75

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

unset chunked prefill flag and added comments

fea114a

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

moved line setting enable chunked prefill to false into conditional a…

2cbba3d

…voiding decoder models Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

disabled prefix caching for non-decoders

bfc5de0

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

restored spyre_util.py

3a96be4

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

Added monkey patch vllm config test

da18d05

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

AbhishekG4 force-pushed the reranker-fix branch from 69f238c to da18d05 Compare March 30, 2026 20:37

adjusted a comment

3bb3ece

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

fixed ruff and ty errors

9be5529

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>

maxdebayser marked this pull request as ready for review March 31, 2026 12:19

maxdebayser approved these changes Mar 31, 2026

View reviewed changes

joerunde merged commit 19b1d6d into torch-spyre:main Mar 31, 2026
14 checks passed

Conversation

AbhishekG4 commented Mar 24, 2026

Description

Related Issues

Test Plan

Checklist

Uh oh!

github-actions Bot commented Mar 24, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

maxdebayser left a comment

Choose a reason for hiding this comment

Uh oh!

tjohnson31415 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Mar 26, 2026

Uh oh!

maxdebayser left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AbhishekG4 commented Mar 26, 2026

Uh oh!

maxdebayser commented Mar 26, 2026

Uh oh!

tjohnson31415 commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

maxdebayser commented Mar 30, 2026

Uh oh!

joerunde commented Mar 30, 2026

Uh oh!

AbhishekG4 commented Mar 30, 2026

Uh oh!

AbhishekG4 commented Mar 30, 2026

Uh oh!

joerunde commented Mar 30, 2026

Uh oh!

maxdebayser left a comment

Choose a reason for hiding this comment

Uh oh!

joerunde commented Mar 31, 2026

Uh oh!

joerunde commented Mar 31, 2026

Uh oh!

Uh oh!

AbhishekG4 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

joerunde commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

tjohnson31415 Mar 24, 2026 •

edited

Loading

tjohnson31415 commented Mar 26, 2026 •

edited

Loading

maxdebayser left a comment •

edited

Loading

tjohnson31415 commented Mar 26, 2026 •

edited

Loading

AbhishekG4 commented Mar 31, 2026 •

edited

Loading