Added fix for bge reranker and e2e test to cover this case#862
Added fix for bge reranker and e2e test to cover this case#862joerunde merged 8 commits intotorch-spyre:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to vLLM support on Spyre. We also recommend installing prek and configuring it to check your code before every local commit. |
4c0ecc8 to
30a8a88
Compare
| if isScoring: | ||
| model = REFERENCE_MODELS["cross-encoder/stsb-roberta-large"] | ||
| return [pytest.param(model, marks=[pytest.mark.scoring], id=model.name)] | ||
| bge_reranker_v2_m3_model = REFERENCE_MODELS["BAAI/bge-reranker-v2-m3"] |
There was a problem hiding this comment.
@maxdebayser Tagging you since I'm not very familiar with encoder model families: Do you think it is ok to use this reranker model in our scoring tests here which use CrossEncoder or should we have a separate test case for this reranker model?
There was a problem hiding this comment.
Yes, it's ok from a functional point of view. But the model is a bit bigger than the stsb-roberta-large model and might be a bit slower to run in out tests.
There was a problem hiding this comment.
Yeah, the tests are failing because the model is not in our cache. We have very limited resources available for the CI tests. I think it's better to revert the changes in this file.
There was a problem hiding this comment.
Ok. That's interesting because one test for the bge-reranker passes and the other fails only due to a failed assertion. Is there any other way I can add a test that you would recommend or should I leave it at that?
There was a problem hiding this comment.
Since we test this model in the pele test suite, I think we can leave it at that.
| if parser is not None: | ||
| parser.set_defaults(enable_prefix_caching=True) | ||
| parser.set_defaults(max_num_batched_tokens=cls.DEFAULT_CHUNK_SIZE) | ||
| parser.set_defaults(enable_chunked_prefill=True) |
There was a problem hiding this comment.
We don't currently use the value, but I think it would be clearer to future devs to set enable_chunked_prefill to False down in check_and_update_config since we don't actually support chunked prefill for pooling/encoder models.
Also, please add a comment here and where we set it back to False noting why we are making this change.
There was a problem hiding this comment.
I see this log pop up confirming the change:
(EngineCore_DP0 pid=631) INFO 03-25 21:28:57 [platform.py:318] Configurations for Spyre. max_model_len=512, max_num_seqs=4, block_size=512, max_num_batched_tokens=2048, enable_chunked_prefill=False, enable_prefix_caching=True
30a8a88 to
50dcc9d
Compare
| cache_config = vllm_config.cache_config | ||
|
|
||
| # unsetting this config as it was only set to pass vllm scheduler's max_model_len check | ||
| vllm_config.enable_chunked_prefill = False |
There was a problem hiding this comment.
It would be confusing to set this to False when using chunked prefill with a decoder model.
Let's move this down further into a conditional that ensures we are only setting False when not using a decoder model. I'm thinking around line 246.
There was a problem hiding this comment.
@AbhishekG4 , can you also disable prefix caching? In the diff in my other comment the right location is shown.
There was a problem hiding this comment.
done. Was not sure whether to add a comment along with it or not. Let me know if I should.
maxdebayser
left a comment
There was a problem hiding this comment.
The scheduler check that is raising the error is:
def verify_max_model_len(self, max_model_len: int) -> Self:
if (
self.max_num_batched_tokens < max_model_len
and not self.enable_chunked_prefill
):
raise ValueError(
f"max_num_batched_tokens ({self.max_num_batched_tokens}) is "
f"smaller than max_model_len ({max_model_len}). "
"This effectively limits the maximum sequence length to "
"max_num_batched_tokens and makes vLLM reject longer "
"sequences. Please increase max_num_batched_tokens or "
"decrease max_model_len."
)
Setting enable_chunked_prefill = True makes the check pass but is not correct for embedding and reranker model which for the most part don't support chunked prefill. For these models, max_num_batched_tokens should be set to max_model_len. vLLM should do that already, but it seems that something in out platform.py logic is preventing it.
The error arises because we set |
|
@tjohnson31415 , ok, I see now where this is coming from. In this case I think we need to disable these two flags because they are reaching the model runner enabled: diff --git a/vllm_spyre/platform.py b/vllm_spyre/platform.py
index 9ac698a..27142b3 100644
--- a/vllm_spyre/platform.py
+++ b/vllm_spyre/platform.py
@@ -246,6 +246,7 @@ class SpyrePlatform(Platform):
scheduler_config.max_num_seqs = max_batch_size
scheduler_config.scheduler_cls = "vllm_spyre.v1.core.scheduler.PoolingSpyreScheduler"
+ vllm_config.scheduler_config.enable_chunked_prefill = False
# Apply model-specific configurations using the registry
# Only when running on Spyre device (sendnn backend)
@@ -299,6 +300,7 @@ class SpyrePlatform(Platform):
model_config.max_model_len * scheduler_config.max_num_seqs
)
cache_config.block_size = model_config.max_model_len # ty: ignore[invalid-assignment]
+ vllm_config.cache_config.enable_prefix_caching = False
else:
cache_config.block_size = cls._block_sizeI also quickly tried an alternative but I haven't run any other test but the cli test so far: #870 . I expect more things to fail due to unintended consequences of setting other defaults. |
|
@maxdebayser @tjohnson31415 there is another issue which is that one of the tests (when I run test_spyre_scoring) fails due to exceeding tolerance levels |
|
@AbhishekG4 , don't worry about the score not passing, in the pele test suite we switched from a relative tolerance to an absolute tolerance for this model because the relative comparison doesn't work very well for values between 0 and 1. It's better to remove this model from the CI tests. |
|
@maxdebayser @AbhishekG4: We do still need a test that validates this behavior and it has to be pretty end-to-end to validate the interaction with the vllm logic too / interaction with CLI flags. Could we just change the tolerances to be absolute? Or we could avoid adding the model by using @AbhishekG4 Also, note the failing DCO check: Will need to run the formatter too. |
|
@tjohnson31415 @AbhishekG4 , I've come up with a workaround to test this behavior without requiring weight or config files in the CI cache. It's not particularly elegant, but works: main...maxdebayser:vllm-spyre:bge_test |
|
Thanks @maxdebayser! I think that's exactly the sort of test that we need that tests loading a config without needing to download the whole model |
|
Great! I will add this then. Was trying some other routes that didn't work last week |
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
…voiding decoder models Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
69f238c to
da18d05
Compare
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
|
@tjohnson31415 @maxdebayser I've added the changes and test them. Everything is working. DCO seems to pass now too, anything else catches your eye? |
|
@AbhishekG4 you'll need to run the formatter, there's a helpful Also the See the failures here: https://github.com/vllm-project/vllm-spyre/actions/runs/23767093511/job/69249427321?pr=862 |
Signed-off-by: Abhishek Gautam <abhishek.gautam@ibm.com>
|
bot:test |
|
This should be good to merge, just running all the encoder tests on spyre to double check that nothing weird breaks. |
|
Thank you all for your support and patience, I learned a lot of little things with this being my first time contributing here. |
|
Thanks for the contribution @AbhishekG4! |
Description
This change fixes a bug preventing reranker models with large max context lengths from starting up.
Vllm has a check that expects the
enable_chunked_prefillflag to be set if max_model_len > max_num_batched_tokens. Thus, this fix enables chunked prefill by default.Related Issues
Fixes #851
Test Plan
There is already testing infrastructure to test reranker models e2e. I simply added the bge-reranker-v2-m3 to the list of default rerankers to test which will cover this case. I observe that the new tests are failing without the changes and successfully start the server with the changes.
Checklist
bash format.sh)Signed-off-by:line (DCO compliance)