Skip to content

CUDA: parallelize graph evaluation across multiple streams#4719

Draft
JohannesGaessler wants to merge 6 commits intoggml-org:masterfrom
JohannesGaessler:cuda-multi-stream-2
Draft

CUDA: parallelize graph evaluation across multiple streams#4719
JohannesGaessler wants to merge 6 commits intoggml-org:masterfrom
JohannesGaessler:cuda-multi-stream-2

Conversation

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler commented Dec 31, 2023

This PR includes an implementation I did which parallelizes the execution of ggml graphs across multiple CUDA streams when possible. For example, there are three parallel matrix multiplications for K, Q, and V at the beginning of each layer. The rationale behind using multiple CUDA streams is to hide the latency from kernel launches and to avoid tail effects where the GPU can run out of work at the end of a kernel. The downside is that more VRAM is needed to store temporary buffers because otherwise the multiple CUDA streams would use the same memory.

Unfortunately doing this parallelization on my systems has worse performance than master:

GPU Model Test MMQ t/s master t/s PR Speedup VRAM master [MiB] VRAM PR [MiB]
RTX 3090 7b q4_0 pp512 No 3430 3438 1 4824 5130
RTX 3090 7b q4_0 pp512 Yes 2387 2394 1 4542 4608
RTX 3090 7b q4_0 tg128 No 136 118.31 0.87 4824 5130
RTX 3090 8x7b q3_k_s pp512 No 369 370 1 21766 22546
RTX 3090 8x7b q3_k_s pp512 Yes 305 335 1.1 21484 21604
RTX 3090 8x7b q3_k_s tg128 No 47.77 48.38 1.01 21766 22546
P40 7b q4_0 pp512 Yes 895 844 0.94 - -
P40 7b q4_0 tg128 Yes 56.60 44.81 0.79 - -
3x P40 70b q6_k pp512 Yes 149 146 0.98 - -
3x P40 70b q6_k tg128 Yes 8.56 7.71 0.9 - -
RX 6800 7b q4_0 pp512 Yes 1242 1227 0.99 - -
RX 6800 7b q4_0 tg128 Yes 74.38 56.26 0.76 - -

With the exception of Mixtral MMQ prompt processing which became slightly faster performance has either stayed the same or regressed. This suggests to me that the overhead from synchronization is generally larger than the speedup from multiple concurrent CUDA streams. Given this data I don't think this PR would be worth merging. However, there may be hardware setups where this PR helps. So it would be useful to me if others could test the performance of this PR against master (particularly Windows users since all of these tests were on Linux). Alternatively, if someone can spot a problem with my implementation that gimps performance that would also be useful.

Implementation Details

ggml_tensor_extra_gpu has been extended by events src0_done and src1_done to potentially wait for the input tensors to be done before execution is started. The integers is and is_branch track the stream that the tensor is being evaluated on and the stream on which the next tensor which uses this tensor as input should be evaluated on. Every time a tensor is used as input the value for is_branch is incremented so a branching graph will result in new CUDA streams being used. Weights have data_constant set to true so that they are being ignored for branching.

@sorasoras
Copy link
Copy Markdown

I tried to compile with your branch with rocm on windows but it does not seems to support it.

@JohannesGaessler
Copy link
Copy Markdown
Contributor Author

What specifically is failing? Compilation? If yes, please provide a log.

@sorasoras
Copy link
Copy Markdown

sorasoras commented Jan 1, 2024

What specifically is failing? Compilation? If yes, please provide a log.

o, nevermind, just wrong conf.
But with your pr, I only get like half utilization of the main branch with half speed. With main branch, my 7900XTX can hit 420~W, but with this pr, i could only ever reach 200w.
ps: it use a lot more cpu than the mainline. Mainline use about 7% of my cpu, but your pr use like 20% with 3 thread load 100%.

cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_DMMV_X=32 -DLLAMA_CUDA_MMV_Y=2 -DBUILD_SHARED_LIBS=ON -DCMAKE_C_COMPILER="clang.exe" -DCMAKE_CXX_COMPILER="clang++.exe" -DAMDGPU_TARGETS="gfx1100"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants