Skip to content

Split by rows instead of layers for llama.cpp multi-gpu#5435

Merged
oobabooga merged 33 commits intooobabooga:devfrom
Ph0rk0z:patch-4
Feb 5, 2024
Merged

Split by rows instead of layers for llama.cpp multi-gpu#5435
oobabooga merged 33 commits intooobabooga:devfrom
Ph0rk0z:patch-4

Conversation

@Ph0rk0z
Copy link
Copy Markdown
Contributor

@Ph0rk0z Ph0rk0z commented Feb 4, 2024

On some cards, the new splitting by layer causes performance. Even on 3090s, the utilization goes from over 50 to 43. P40s actually have demonstrable losses. This parameter lets you split by rows like theoriginal behavior and should fix those speed issues. Default behavior should still be splitting by layers.

oobabooga and others added 30 commits December 14, 2023 22:39
@oobabooga
Copy link
Copy Markdown
Owner

Is there a reason to not have split by rows by default if it leads to better performance?

@Ph0rk0z
Copy link
Copy Markdown
Contributor Author

Ph0rk0z commented Feb 5, 2024

I kept the default behavior of l.cpp and also have no way to test 4090 or all different combinations. I can say P40 gains its 2 or 3 t/s back and 3090 goes from 40% utilization per GPU to over 5X%.

Nothing makes it like pre ggml-org/llama.cpp#4606 unfortunately.

@oobabooga
Copy link
Copy Markdown
Owner

Fair enough

@oobabooga oobabooga changed the base branch from main to dev February 5, 2024 02:36
@oobabooga oobabooga merged commit 2a45620 into oobabooga:dev Feb 5, 2024
PoetOnTheRun pushed a commit to PoetOnTheRun/text-generation-webui that referenced this pull request Feb 22, 2024
@Ph0rk0z Ph0rk0z deleted the patch-4 branch May 12, 2024 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants