-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Model] Add Gemma 2 #5908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Model] Add Gemma 2 #5908
Conversation
|
|
|
@WoosukKwon running with |
|
@robertgshaw2-neuralmagic Yes I did a similar thing in |
The only issue with what you have done is that If you set Line 137 in df2c007
If we do not want this as the default behavior, we could instead let the user know about this flag in the |
|
@robertgshaw2-neuralmagic Thanks for letting me know! Update the PR. PTAL. |
| "layer, vLLM currently ignores it and uses global attention " | ||
| "for all layers. This might affect the model's behavior when " | ||
| "the context length is larger than the sliding window size " | ||
| f"({self.hf_text_config.sliding_window}).") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good.
I think this warning should be updated to say something like Gemma 2 uses sliding window attention for every odd layer, which is not supported by vllm. Disabling sliding window and capping max length to sliding_window_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic Oh maybe I misunderstood the change here. My intention was to enable the full (8K) context length with global attention for all layers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay sorry for the confusion the way you had it before will do this :)
Setting disable_sliding_window=True will cap to sliding_window_size
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either option is reasonable.
- For capping at 4k, its more conservative re: model accuracy
- For capping at 8k, its less conservative re: model accuracy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic Hmm... OK let's use 4K context length for now and see if people want 8K content length despite the difference from the original model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic Updated the warning msg. PTAL!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| self.weight = nn.Parameter(torch.zeros(hidden_size)) | ||
| self.variance_epsilon = eps | ||
|
|
||
| def forward_native( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try decorating this with @torch.compile, similar to what we do in Command R
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I was thinking about it or writing a CUDA kernel. Let's discuss this in another PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good -- agree it should be in a different PR :)
For the 27b model, logit soft-capping seems to be very important: huggingface/transformers#31698. 9b model works fine without it. |
|
Thanks for the PR and supporting Gemma 2! Will the 8k context length be supported in the future? |
|
Multiple sources (e.g., https://www.reddit.com/r/LocalLLaMA/comments/1dusu3s/gemma_2_finetuning_2x_faster_63_less_memory_best/) have confirmed that softcapping is an absolute necessity for the 27b checkpoint. Are there any plans of making this available in vllm? Otherwise, the generation of the 27b checkpoint is useless... |
Release v0.5.1 from the weekend supports logits soft capping with the FLASHINFER attention backend |
|
Make sure that you match the proper torch and CUDA versions (torch 2.3 and likely cuda 12.1 is what you want) |
Thank you very much for your contribution, but @Hi-archers in #6166 (comment) currently has several users experiencing a "Segmentation fault (core dumped)" error after performing extensive Gemma2 inferences using the FlashInfer backend. My current environment is Torch 2.3.0, Cuda 12.1, and FlashInfer 0.08. I hope you can address this issue. Thank you. |
|
FlashInfer is built for specific CUDA versions and PyTorch versions. So you can have CUDA 12.1 and Torch2.3.0 but you may have installed FlashInfer built with Torch 2.2.0. When this happens, we will get the error The default whl for vllm is Python 2.3 and CUDA 12.1. So you likely want to install the following FlashInfer whl: PYTHON_VERSION=310
wget https://github.com/flashinfer-ai/flashinfer/releases/download/v0.0.8/flashinfer-0.0.8+cu121torch2.3-cp${PYTHON_VERSION}-cp${PYTHON_VERSION}-linux_x86_64.whl |
TKS, Has been running, but the openai interface call answer has been empty. |
|
Please refrain from posting images of your errors. Rather, paste the the test so that I can copy/paste and so that it is searchable Can you please try the instruction model rather than the pretrained model with the chat interface? |
Thank you for your response. However, even after following your instructions to install FLASHINFER, I still encountered a Segmentation fault (core dumped). This time it occurred at 3729/3822 lines, whereas previously it happened at 3708/3822 lines. Should I open a new issue to address this problem? |
|
Can you do |
pip show vllm: Name: vllm My Code: import json
import time
from vllm import LLM, SamplingParams
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import argparse
from tqdm import tqdm
import os
import sys
os.environ["VLLM_ATTENTION_BACKEND"] = "FLASHINFER"
os.environ["HF_TOKEN"] = "<TOKEN>"
parser = argparse.ArgumentParser()
parser.add_argument('--top', type=str, default="3")
parser.add_argument('--seed', type=int, default=42)
parser.add_argument('--cuda', type=int, default=1)
parser.add_argument('--utili', type=float, default=1)
parser.add_argument('--model_name', type=str, default="/data1/**/Gemma2/gemma-2-9b-it")
parser.add_argument('--title', type=int, default=1)
parser.add_argument('--temperature', type=float, default=0.)
args = parser.parse_args()
print(args)
os.environ["CUDA_VISIBLE_DEVICES"] = str(args.cuda)
sys.path.append(os.path.abspath("../../"))
from system_prompt import system_prompts, demonstration, instruction
sampling_params = SamplingParams(
temperature=args.temperature,
seed=args.seed,
max_tokens=100,
)
print(sampling_params)
model_name = args.model_name
tokenizer = AutoTokenizer.from_pretrained(model_name)
llm = LLM(
model=model_name,
seed=args.seed,
gpu_memory_utilization=args.utili,
)
if __name__ == "__main__":
answer = []
prompts = []
for i, line in tqdm(enumerate(que), total=len(que)):
for i_line in tmp:
prompts.append(get_template(i_line["que"], i_line['A'], i_line["B"]))
print(len(prompts)) # 3822
outputs = llm.generate(prompts, sampling_params)I removed the irrelevant code from my code. |
I downloaded the wrong version. It's working fine now. |
|
I can get it to run with flashinfer as the attention backend, but results are still abysmal. |
|
Hi! I'm using the latest vLLM image on docker, on GCP using A100 GPUs. I'm getting the following error when making many requests to the OpenAI server, using the 27GB instruction tuned model: This is with vLLM 0.5.1 with What am I doing wrong?
|
can you please share your code sample? how you load the gemma2 27b with its params? |
We use vLLM through k8s, this is the relevant snippet from the yaml: |
|
When I am running gemm2-9b-it on h100, 80GB. The speed is very slow for me, like 20 TPS, any idea why? I launched with Lora adapter. its like much faster on sglang, with 43TPS. |
|
Is there currently a plan to support sliding windows? |
|
Has it been changed since then? Is SLA supported without cap on the context length? |



This PR adds Gemma 2, a new family of open LLMs from Google.
Two major issues to note:
These issues will also be explicitly mentioned in warning messages.