-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Fix prefix caching is currently not supported with sliding window attention when using qwen1.5 #3377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix prefix caching is currently not supported with sliding window attention when using qwen1.5 #3377
Conversation
cadedaniel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix!
vllm/config.py
Outdated
| def get_sliding_window(self) -> Optional[int]: | ||
| return getattr(self.hf_config, "sliding_window", None) | ||
| return (getattr(self.hf_config, "sliding_window", None) | ||
| if self.get_use_sliding_window() else None) | ||
|
|
||
| def get_use_sliding_window(self) -> bool: | ||
| return getattr(self.hf_config, "use_sliding_window", False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we remove the public function
get_us_sliding_windowunless we need it in the codebase? - Can we add some docstrings here?
- Can we add a unit test for this functionality?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- docstrings
Of course,I will submit unit tests and some docstrings later.
|
@cadedaniel I have submitted unit tests and docstrings |
…emory (OOM) issues.
|
Hi @a516072575 , sorry it seems #3373 got in a little faster. |
|
Good catch -- created #3437, can you review? |
My model is trained using qwen1.5. When I use prefix caching to do offline batch inference, the program throws AssertionError: Prefix caching is currently not supported with sliding window attention. I noticed that in vllm/config.py, the get_sliding_window function does not take into account whether the value of use_sliding_window is set to true.
The use_sliding_window is a new configuration option introduced in Qwen 1.5 that controls whether to use a sliding window, and its default value is set to false. So I add a judgment for value of use_sliding_window when calling the get_sliding_window function. This is similar to the implementation in qwen2.py
The AssertionError error message is as follows
File "/test.py", line 45, in outputs = model.generate(prompts, sampling_params, prefix_pos=[prefix_pos] * len(prompts)) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 182, in generate return self._run_engine(use_tqdm) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 208, in _run_engine step_outputs = self.llm_engine.step() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 838, in step all_outputs = self._run_workers( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 1041, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 223, in execute_model output = self.model_runner.execute_model(seq_group_metadata_list, File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 571, in execute_model lora_mapping) = self.prepare_input_tensors(seq_group_metadata_list) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 490, in prepare_input_tensors lora_requests) = self._prepare_prompt(seq_group_metadata_list) File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 193, in _prepare_prompt assert prefix_len == 0, ( AssertionError: Prefix caching is currently not supported with sliding window attention