Skip to content

Conversation

@NiuBlibing
Copy link
Contributor

@NiuBlibing NiuBlibing commented Jun 13, 2024

Like #4007, to support qwen2-72b-instruct's lora adapter with 1,2,4,8 tp-size.

Ref #3793

@NiuBlibing NiuBlibing changed the title Add 3696 bgmv-kernel to support qwen2-72b-instruct lora Add 3696 bgmv-kernel to support qwen2-72b-instruct lora with tp 8 Jun 13, 2024
@NiuBlibing NiuBlibing changed the title Add 3696 bgmv-kernel to support qwen2-72b-instruct lora with tp 8 Add 3696 bgmv-kernel to support qwen2-72b-instruct lora Jun 13, 2024
@NiuBlibing NiuBlibing changed the title Add 3696 bgmv-kernel to support qwen2-72b-instruct lora support load qwen2-72b-instruct lora Jun 13, 2024
@NiuBlibing NiuBlibing marked this pull request as draft June 13, 2024 10:33
@NiuBlibing NiuBlibing closed this Jun 13, 2024
@NiuBlibing NiuBlibing reopened this Jun 13, 2024
@NiuBlibing NiuBlibing closed this Jun 13, 2024
@NiuBlibing NiuBlibing reopened this Jun 14, 2024
@NiuBlibing NiuBlibing closed this Jun 14, 2024
@NiuBlibing NiuBlibing reopened this Jun 14, 2024
@NiuBlibing NiuBlibing closed this Jun 14, 2024
@NiuBlibing
Copy link
Contributor Author

Currntly punica kernel cannot support Qwen2-72B-Instruct because of 3696 could not be divided by 64. Hope #5036 or #5356 will work.

@jeejeelee
Copy link
Collaborator

jeejeelee commented Jun 14, 2024

Could you provide your running script?

I can test Qwen2-72B-Instruct+LoRA on my local device using #5036.

@NiuBlibing
Copy link
Contributor Author

Could you provide your running script?

I can test Qwen2-72B-Instruct+LoRA on my local device using #5356.

I just start it with vllm cli.

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Chat-test --model ./Qwen/Qwen2-72B-Instruct/ --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --enable-lora --lora-dtype bfloat16 --lora-modules test=/path/to/lora/

@jeejeelee
Copy link
Collaborator

Could you provide your running script?
I can test Qwen2-72B-Instruct+LoRA on my local device using #5356.

I just start it with vllm cli.

python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-72B-Chat-test --model ./Qwen/Qwen2-72B-Instruct/ --gpu-memory-utilization 0.9 --tensor-parallel-size 8 --enable-lora --lora-dtype bfloat16 --lora-modules test=/path/to/lora/

Sorry, Actually, #5036 was used for the testing.

I have completed the test, #5036 can resolve this issue.

However, there are still some other issues that need to be resolved with #5036, I will process ASAP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants