Skip to content

Added fix for bge reranker and e2e test to cover this case#862

Merged
joerunde merged 8 commits intotorch-spyre:mainfrom
AbhishekG4:reranker-fix
Mar 31, 2026
Merged

Added fix for bge reranker and e2e test to cover this case#862
joerunde merged 8 commits intotorch-spyre:mainfrom
AbhishekG4:reranker-fix

Conversation

@AbhishekG4
Copy link
Copy Markdown
Contributor

Description

This change fixes a bug preventing reranker models with large max context lengths from starting up.
Vllm has a check that expects the enable_chunked_prefill flag to be set if max_model_len > max_num_batched_tokens. Thus, this fix enables chunked prefill by default.

Related Issues

Fixes #851

Test Plan

There is already testing infrastructure to test reranker models e2e. I simply added the bge-reranker-v2-m3 to the list of default rerankers to test which will cover this case. I observe that the new tests are failing without the changes and successfully start the server with the changes.

Checklist

  • I have read the contributing guidelines
  • My code follows the project's code style (run bash format.sh)
  • I have added tests for my changes (if applicable)
  • I have updated the documentation (if applicable)
  • My commits include a Signed-off-by: line (DCO compliance)

@github-actions
Copy link
Copy Markdown

👋 Hi! Thank you for contributing to vLLM support on Spyre.
Just a reminder: Make sure that your code passes all the linting checks, otherwise your PR won't be able to be merged. To do so, run ./format.sh.
Now you are good to go 🚀.

We also recommend installing prek and configuring it to check your code before every local commit.

Comment thread tests/spyre_util.py Outdated
if isScoring:
model = REFERENCE_MODELS["cross-encoder/stsb-roberta-large"]
return [pytest.param(model, marks=[pytest.mark.scoring], id=model.name)]
bge_reranker_v2_m3_model = REFERENCE_MODELS["BAAI/bge-reranker-v2-m3"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxdebayser Tagging you since I'm not very familiar with encoder model families: Do you think it is ok to use this reranker model in our scoring tests here which use CrossEncoder or should we have a separate test case for this reranker model?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's ok from a functional point of view. But the model is a bit bigger than the stsb-roberta-large model and might be a bit slower to run in out tests.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the tests are failing because the model is not in our cache. We have very limited resources available for the CI tests. I think it's better to revert the changes in this file.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That's interesting because one test for the bge-reranker passes and the other fails only due to a failed assertion. Is there any other way I can add a test that you would recommend or should I leave it at that?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we test this model in the pele test suite, I think we can leave it at that.

Comment thread vllm_spyre/platform.py Outdated
if parser is not None:
parser.set_defaults(enable_prefix_caching=True)
parser.set_defaults(max_num_batched_tokens=cls.DEFAULT_CHUNK_SIZE)
parser.set_defaults(enable_chunked_prefill=True)
Copy link
Copy Markdown
Collaborator

@tjohnson31415 tjohnson31415 Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't currently use the value, but I think it would be clearer to future devs to set enable_chunked_prefill to False down in check_and_update_config since we don't actually support chunked prefill for pooling/encoder models.

Also, please add a comment here and where we set it back to False noting why we are making this change.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this log pop up confirming the change:

(EngineCore_DP0 pid=631) INFO 03-25 21:28:57 [platform.py:318] Configurations for Spyre. max_model_len=512, max_num_seqs=4, block_size=512, max_num_batched_tokens=2048, enable_chunked_prefill=False, enable_prefix_caching=True

Comment thread vllm_spyre/platform.py Outdated
cache_config = vllm_config.cache_config

# unsetting this config as it was only set to pass vllm scheduler's max_model_len check
vllm_config.enable_chunked_prefill = False
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be confusing to set this to False when using chunked prefill with a decoder model.

Let's move this down further into a conditional that ensures we are only setting False when not using a decoder model. I'm thinking around line 246.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhishekG4 , can you also disable prefix caching? In the diff in my other comment the right location is shown.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Was not sure whether to add a comment along with it or not. Let me know if I should.

Comment thread vllm_spyre/platform.py Outdated
Copy link
Copy Markdown
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scheduler check that is raising the error is:

    def verify_max_model_len(self, max_model_len: int) -> Self:
        if (
            self.max_num_batched_tokens < max_model_len
            and not self.enable_chunked_prefill
        ):
            raise ValueError(
                f"max_num_batched_tokens ({self.max_num_batched_tokens}) is "
                f"smaller than max_model_len ({max_model_len}). "
                "This effectively limits the maximum sequence length to "
                "max_num_batched_tokens and makes vLLM reject longer "
                "sequences. Please increase max_num_batched_tokens or "
                "decrease max_model_len."
            )

Setting enable_chunked_prefill = True makes the check pass but is not correct for embedding and reranker model which for the most part don't support chunked prefill. For these models, max_num_batched_tokens should be set to max_model_len. vLLM should do that already, but it seems that something in out platform.py logic is preventing it.

@tjohnson31415
Copy link
Copy Markdown
Collaborator

tjohnson31415 commented Mar 26, 2026

@maxdebayser

For these models, max_num_batched_tokens should be set to max_model_len. vLLM should do that already, but it seems that something in out platform.py logic is preventing it.

The error arises because we set max_num_batched_tokens in the CLI args for embedding models too (there's no model info to query at that point). vLLM sees that as a user-provided setting and does not override it and then errors when checked vs max_model_len. We discussed with @joerunde and decided that setting enable_chunked_prefill was a simple-enough workaround.

@maxdebayser
Copy link
Copy Markdown
Collaborator

@tjohnson31415 , ok, I see now where this is coming from. In this case I think we need to disable these two flags because they are reaching the model runner enabled:

diff --git a/vllm_spyre/platform.py b/vllm_spyre/platform.py
index 9ac698a..27142b3 100644
--- a/vllm_spyre/platform.py
+++ b/vllm_spyre/platform.py
@@ -246,6 +246,7 @@ class SpyrePlatform(Platform):
             scheduler_config.max_num_seqs = max_batch_size
 
             scheduler_config.scheduler_cls = "vllm_spyre.v1.core.scheduler.PoolingSpyreScheduler"
+            vllm_config.scheduler_config.enable_chunked_prefill = False
 
         # Apply model-specific configurations using the registry
         # Only when running on Spyre device (sendnn backend)
@@ -299,6 +300,7 @@ class SpyrePlatform(Platform):
                     model_config.max_model_len * scheduler_config.max_num_seqs
                 )
                 cache_config.block_size = model_config.max_model_len  # ty: ignore[invalid-assignment]
+                vllm_config.cache_config.enable_prefix_caching = False
 
             else:
                 cache_config.block_size = cls._block_size

I also quickly tried an alternative but I haven't run any other test but the cli test so far: #870 . I expect more things to fail due to unintended consequences of setting other defaults.

Copy link
Copy Markdown
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Feel free to mark it ready for review.

@AbhishekG4
Copy link
Copy Markdown
Contributor Author

@maxdebayser @tjohnson31415 there is another issue which is that one of the tests (when I run test_spyre_scoring) fails due to exceeding tolerance levels

FAILED test_spyre_scoring.py::test_serving[sendnn-warmup_shapes0-BAAI/bge-reranker-v2-m3] - assert np.float32(0.05998862) == 0.05717248097...1 ± 0.00114345

@maxdebayser
Copy link
Copy Markdown
Collaborator

@AbhishekG4 , don't worry about the score not passing, in the pele test suite we switched from a relative tolerance to an absolute tolerance for this model because the relative comparison doesn't work very well for values between 0 and 1. It's better to remove this model from the CI tests.

@tjohnson31415
Copy link
Copy Markdown
Collaborator

tjohnson31415 commented Mar 26, 2026

@maxdebayser @AbhishekG4: We do still need a test that validates this behavior and it has to be pretty end-to-end to validate the interaction with the vllm logic too / interaction with CLI flags. Could we just change the tolerances to be absolute? Or we could avoid adding the model by using cross-encoder/stsb-roberta-large still but in a separate test where the configuration is such that it throws the same error without the fix from the PR.

@AbhishekG4 Also, note the failing DCO check:
https://github.com/vllm-project/vllm-spyre/pull/862/checks?check_run_id=68794377418
You will need to sign-off on your commits. If you click the link it includes instructions on how to add the signatures.

Will need to run the formatter too.

@maxdebayser
Copy link
Copy Markdown
Collaborator

@tjohnson31415 @AbhishekG4 , I've come up with a workaround to test this behavior without requiring weight or config files in the CI cache. It's not particularly elegant, but works: main...maxdebayser:vllm-spyre:bge_test

@joerunde
Copy link
Copy Markdown
Collaborator

Thanks @maxdebayser! I think that's exactly the sort of test that we need that tests loading a config without needing to download the whole model

@AbhishekG4
Copy link
Copy Markdown
Contributor Author

Great! I will add this then. Was trying some other routes that didn't work last week

Abhishek Gautam and others added 5 commits March 30, 2026 15:24
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
…voiding decoder models

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
@AbhishekG4
Copy link
Copy Markdown
Contributor Author

@tjohnson31415 @maxdebayser I've added the changes and test them. Everything is working. DCO seems to pass now too, anything else catches your eye?

@joerunde
Copy link
Copy Markdown
Collaborator

@AbhishekG4 you'll need to run the formatter, there's a helpful ./format.sh to run or you can set up prek locally.

Also the ty check is complaining that vllm_config.enable_chunked_prefill is not a real attribute, it should probably be vllm_config.cache_config.enable_prefix_caching.

See the failures here: https://github.com/vllm-project/vllm-spyre/actions/runs/23767093511/job/69249427321?pr=862

Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
@maxdebayser maxdebayser marked this pull request as ready for review March 31, 2026 12:19
Copy link
Copy Markdown
Collaborator

@maxdebayser maxdebayser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@joerunde
Copy link
Copy Markdown
Collaborator

bot:test
MARKERS="spyre and not decoder"

@joerunde
Copy link
Copy Markdown
Collaborator

This should be good to merge, just running all the encoder tests on spyre to double check that nothing weird breaks.

@joerunde joerunde merged commit 19b1d6d into torch-spyre:main Mar 31, 2026
14 checks passed
@AbhishekG4
Copy link
Copy Markdown
Contributor Author

AbhishekG4 commented Mar 31, 2026

Thank you all for your support and patience, I learned a lot of little things with this being my first time contributing here.

@joerunde
Copy link
Copy Markdown
Collaborator

Thanks for the contribution @AbhishekG4!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Reranker vllm server failing on RHAIIS 3.5 image

4 participants