Skip to content

Add llama-cpp-python wheels with tensor cores support#5003

Merged
oobabooga merged 2 commits intodevfrom
tensorcores
Dec 19, 2023
Merged

Add llama-cpp-python wheels with tensor cores support#5003
oobabooga merged 2 commits intodevfrom
tensorcores

Conversation

@oobabooga
Copy link
Copy Markdown
Owner

@oobabooga oobabooga commented Dec 19, 2023

It is recommended to use this option if you have a newer NVIDIA GPU like RTX series.

Hopefully this will become a runtime flag in the future. For now, I had to compile a new set of llama_cpp_cuda_tensorcore wheels using GitHub Actions without the -DLLAMA_CUDA_FORCE_MMQ=ON flag to support this feature.

llama-2-13b.Q3_K_M.gguf

fully offloaded to the GPU

without --tensorcores:

llama_print_timings:        load time =     508.69 ms
llama_print_timings:      sample time =      91.02 ms /   512 runs   (    0.18 ms per token,  5625.08 tokens per second)
llama_print_timings: prompt eval time =    3545.43 ms /  3200 tokens (    1.11 ms per token,   902.57 tokens per second)
llama_print_timings:        eval time =   14075.88 ms /   511 runs   (   27.55 ms per token,    36.30 tokens per second)
llama_print_timings:       total time =   18381.46 ms
Output generated in 18.61 seconds (27.52 tokens/s, 512 tokens, context 3200, seed 107801409)

with --tensorcores:

llama_print_timings:        load time =     272.25 ms
llama_print_timings:      sample time =      88.52 ms /   512 runs   (    0.17 ms per token,  5783.87 tokens per second)
llama_print_timings: prompt eval time =    1869.05 ms /  3200 tokens (    0.58 ms per token,  1712.10 tokens per second)
llama_print_timings:        eval time =    9889.70 ms /   511 runs   (   19.35 ms per token,    51.67 tokens per second)
llama_print_timings:       total time =   12484.12 ms
Output generated in 12.70 seconds (40.30 tokens/s, 512 tokens, context 3200, seed 1845973748)

mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf

With 20/33 layers offloaded to the GPU.

without --tensorcores

llama_print_timings:        load time =   14624.69 ms
llama_print_timings:      sample time =      92.02 ms /   512 runs   (    0.18 ms per token,  5563.83 tokens per second)
llama_print_timings: prompt eval time =   96262.41 ms /  3167 tokens (   30.40 ms per token,    32.90 tokens per second)
llama_print_timings:        eval time =   45573.66 ms /   511 runs   (   89.19 ms per token,    11.21 tokens per second)
llama_print_timings:       total time =  143372.46 ms
Output generated in 143.60 seconds (3.57 tokens/s, 512 tokens, context 3167, seed 870331982)

with --tensorcores:

llama_print_timings:        load time =   16319.61 ms
llama_print_timings:      sample time =      99.37 ms /   512 runs   (    0.19 ms per token,  5152.67 tokens per second)
llama_print_timings: prompt eval time =   95983.94 ms /  3167 tokens (   30.31 ms per token,    33.00 tokens per second)
llama_print_timings:        eval time =   44855.77 ms /   511 runs   (   87.78 ms per token,    11.39 tokens per second)
llama_print_timings:       total time =  142425.11 ms
Output generated in 142.65 seconds (3.59 tokens/s, 512 tokens, context 3167, seed 1493079911)

@oobabooga oobabooga merged commit de138b8 into dev Dec 19, 2023
@oobabooga oobabooga deleted the tensorcores branch December 20, 2023 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant