CUDA: parallelize graph evaluation across multiple streams#4719
CUDA: parallelize graph evaluation across multiple streams#4719JohannesGaessler wants to merge 6 commits intoggml-org:masterfrom
Conversation
|
I tried to compile with your branch with rocm on windows but it does not seems to support it. |
|
What specifically is failing? Compilation? If yes, please provide a log. |
o, nevermind, just wrong conf. cmake .. -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DLLAMA_HIPBLAS=ON -DLLAMA_CUDA_DMMV_X=32 -DLLAMA_CUDA_MMV_Y=2 -DBUILD_SHARED_LIBS=ON -DCMAKE_C_COMPILER="clang.exe" -DCMAKE_CXX_COMPILER="clang++.exe" -DAMDGPU_TARGETS="gfx1100" |
This PR includes an implementation I did which parallelizes the execution of ggml graphs across multiple CUDA streams when possible. For example, there are three parallel matrix multiplications for K, Q, and V at the beginning of each layer. The rationale behind using multiple CUDA streams is to hide the latency from kernel launches and to avoid tail effects where the GPU can run out of work at the end of a kernel. The downside is that more VRAM is needed to store temporary buffers because otherwise the multiple CUDA streams would use the same memory.
Unfortunately doing this parallelization on my systems has worse performance than master:
With the exception of Mixtral MMQ prompt processing which became slightly faster performance has either stayed the same or regressed. This suggests to me that the overhead from synchronization is generally larger than the speedup from multiple concurrent CUDA streams. Given this data I don't think this PR would be worth merging. However, there may be hardware setups where this PR helps. So it would be useful to me if others could test the performance of this PR against master (particularly Windows users since all of these tests were on Linux). Alternatively, if someone can spot a problem with my implementation that gimps performance that would also be useful.
Implementation Details
ggml_tensor_extra_gpuhas been extended by eventssrc0_doneandsrc1_doneto potentially wait for the input tensors to be done before execution is started. The integersisandis_branchtrack the stream that the tensor is being evaluated on and the stream on which the next tensor which uses this tensor as input should be evaluated on. Every time a tensor is used as input the value foris_branchis incremented so a branching graph will result in new CUDA streams being used. Weights havedata_constantset totrueso that they are being ignored for branching.