CUDA: parallelize graph evaluation across multiple streams by JohannesGaessler · Pull Request #4719 · ggml-org/llama.cpp

JohannesGaessler · 2023-12-31T17:15:55Z

This PR includes an implementation I did which parallelizes the execution of ggml graphs across multiple CUDA streams when possible. For example, there are three parallel matrix multiplications for K, Q, and V at the beginning of each layer. The rationale behind using multiple CUDA streams is to hide the latency from kernel launches and to avoid tail effects where the GPU can run out of work at the end of a kernel. The downside is that more VRAM is needed to store temporary buffers because otherwise the multiple CUDA streams would use the same memory.

Unfortunately doing this parallelization on my systems has worse performance than master:

GPU	Model	Test	MMQ	t/s master	t/s PR	Speedup	VRAM master [MiB]	VRAM PR [MiB]
RTX 3090	7b q4_0	pp512	No	3430	3438	1	4824	5130
RTX 3090	7b q4_0	pp512	Yes	2387	2394	1	4542	4608
RTX 3090	7b q4_0	tg128	No	136	118.31	0.87	4824	5130
RTX 3090	8x7b q3_k_s	pp512	No	369	370	1	21766	22546
RTX 3090	8x7b q3_k_s	pp512	Yes	305	335	1.1	21484	21604
RTX 3090	8x7b q3_k_s	tg128	No	47.77	48.38	1.01	21766	22546
P40	7b q4_0	pp512	Yes	895	844	0.94	-	-
P40	7b q4_0	tg128	Yes	56.60	44.81	0.79	-	-
3x P40	70b q6_k	pp512	Yes	149	146	0.98	-	-
3x P40	70b q6_k	tg128	Yes	8.56	7.71	0.9	-	-
RX 6800	7b q4_0	pp512	Yes	1242	1227	0.99	-	-
RX 6800	7b q4_0	tg128	Yes	74.38	56.26	0.76	-	-

With the exception of Mixtral MMQ prompt processing which became slightly faster performance has either stayed the same or regressed. This suggests to me that the overhead from synchronization is generally larger than the speedup from multiple concurrent CUDA streams. Given this data I don't think this PR would be worth merging. However, there may be hardware setups where this PR helps. So it would be useful to me if others could test the performance of this PR against master (particularly Windows users since all of these tests were on Linux). Alternatively, if someone can spot a problem with my implementation that gimps performance that would also be useful.

Implementation Details

ggml_tensor_extra_gpu has been extended by events src0_done and src1_done to potentially wait for the input tensors to be done before execution is started. The integers is and is_branch track the stream that the tensor is being evaluated on and the stream on which the next tensor which uses this tensor as input should be evaluated on. Every time a tensor is used as input the value for is_branch is incremented so a branching graph will result in new CUDA streams being used. Weights have data_constant set to true so that they are being ignored for branching.

sorasoras · 2024-01-01T14:54:32Z

I tried to compile with your branch with rocm on windows but it does not seems to support it.

JohannesGaessler · 2024-01-01T15:05:00Z

What specifically is failing? Compilation? If yes, please provide a log.

sorasoras · 2024-01-01T15:30:52Z

What specifically is failing? Compilation? If yes, please provide a log.

o, nevermind, just wrong conf.
But with your pr, I only get like half utilization of the main branch with half speed. With main branch, my 7900XTX can hit 420~W, but with this pr, i could only ever reach 200w.
ps: it use a lot more cpu than the mainline. Mainline use about 7% of my cpu, but your pr use like 20% with 3 thread load 100%.

cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_DMMV_X=32 -DLLAMA_CUDA_MMV_Y=2 -DBUILD_SHARED_LIBS=ON -DCMAKE_C_COMPILER="clang.exe" -DCMAKE_CXX_COMPILER="clang++.exe" -DAMDGPU_TARGETS="gfx1100"

JohannesGaessler added 6 commits December 31, 2023 15:30

multiple streams

f0a176b

try fix

452749e

4 cuda streams

c6047a0

fixup

1c9a6c5

try HIP fix

44d6262

try HIP fix

8dd9013

JohannesGaessler mentioned this pull request Nov 4, 2025

CUDA: add stream-based concurrency #16991

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA: parallelize graph evaluation across multiple streams#4719

CUDA: parallelize graph evaluation across multiple streams#4719
JohannesGaessler wants to merge 6 commits intoggml-org:masterfrom
JohannesGaessler:cuda-multi-stream-2

JohannesGaessler commented Dec 31, 2023 •

edited

Loading

Uh oh!

sorasoras commented Jan 1, 2024

Uh oh!

JohannesGaessler commented Jan 1, 2024

Uh oh!

sorasoras commented Jan 1, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JohannesGaessler commented Dec 31, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation Details

Uh oh!

sorasoras commented Jan 1, 2024

Uh oh!

JohannesGaessler commented Jan 1, 2024

Uh oh!

sorasoras commented Jan 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JohannesGaessler commented Dec 31, 2023 •

edited

Loading

sorasoras commented Jan 1, 2024 •

edited

Loading