ggml: backend-agnostic tensor parallelism#19378
ggml: backend-agnostic tensor parallelism#19378JohannesGaessler wants to merge 45 commits intoggml-org:masterfrom
Conversation
I'm not seeing a deadlock, just a crash in the driver with an invalid descriptor. I ran It seems like tensor->data (and tensor->view_src->data) are too large. I haven't debugged further. |
|
works for me on 2x3090 for llama 3 8B and Mistral Nemo 12B on Devstral I have OOM (expected because model size?)but on MInistral 14B tooqwen 4B has a different issue |
|
Interesting. If the CPU backend is able to be virtualized into multiple devices as described here, would it be possible to allow multiple NUMA nodes to be parallelized? |
|
will this pr solve if we use multiple rpc connected to gpu with cpu and when we use cpu moe flag ? |
|
@jacekpoplawski the combination of @DocShotgun longer-term I intend to also enable this code for better NUMA support though I'm not yet sure what to do in terms of hardware for development. Originally I had intended to buy 1.5 TiB of DDR5 RAM and 2 EPYC CPUs but at the current prices that would be financially irresponsible of me to do. @gopinath87607 I don't understand what you mean. |
If you're looking for a DDR4 platform, I might be able to help with that, but unfortunately, not so much with RAM for the system. I'm also in Germany. Wouldn't mind giving you access to my systems if you want. Have two dual Xeon systems one with P40s and the other with Mi50s. Would be very happy to help either way. |
|
Started doing some initial tests to get familiar with the changes. I'm using virtual Metal devices and things appear to be mostly working - e.g. seeing graph execution on both devices, However, the following command does not produce identical results on each run: GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensorI think this either means that there could be a problem with the fallback for the missing backend API, or that I could have made an error in the implementation of the events and the async copy in Metal. Will investigate more.
In the meantime, @JohannesGaessler do you confirm that the command above has deterministic results on your end? Also, if is it deterministic with the fallback calls? |
|
Generally speaking you cannot expect bit-for-bit identical results if you split the computation across multiple virtual devices. The order in which floats are summed up will be different which will in turn change the rounding error. If I run If you use |
|
Sorry, I think I misread your post. If you are saying that the results are not deterministic with 2 virtual GPUs but they are with 1 GPU then that I think is indicative of a bug w.r.t. the synchronization. |
|
Yes, I understand that 1GPU vs 2GPU will not be bit-for-bit identical. What I mean is that in my test, running the command with 2 GPUs several times produces non-deterministic results from one run to the other: GGML_METAL_DEVICES=2 ./bin/llama-completion -m ~/models/llama-3.1-8b/ggml-model-f16.gguf -no-cnv -p "I believe the meaning of life is" -n 32 --top-k 1 -sm tensor
# run 1
I believe the meaning of life is to find your gift. The purpose of life is to give it away.
I believe that the meaning of life is to find your gift. The purpose of life
# run 2
I believe the meaning of life is to find your gift. The purpose of life is to give it away. To give it away, you have to find it. [end of text]
# run 3
I believe the meaning of life is to be happy. I believe that happiness is the only thing that matters. I believe that happiness is the only thing that matters. I believe that happiness is the
Yes, seems like a synchronization issue. Was wondering if you observe it on your end with and without the fallback backend API. This will give me indication where to look for the issue. |
|
With the command you posted (minus the |
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
…rf and stability (#12)
108fea3 to
4e0b9ae
Compare
|
I pushed a rebased version with the recent improvements by @gaugarg-nv as well as more comprehensive tests in |
|
Not sure what the etiquette is here for folk who can't contribute meaningfully to the technical side of things, but here's some testing from the AMD/Vulkan world on a pair of R9700s: Dense: -73% PP, -56% TG Is an optimisation pass planned for Vulkan, or is it coming in a later PR to avoid holding this one up? |
|
I am one of the CUDA/ROCm maintainers so those are the backends I'm targeting. I don't know what the exact plans of the Vulkan maintainers are. |
Fair enough - apologies, I shall step out of the way then :) |
|
Vulkan can do device to device dma via device groups but Vulkan implementations have spoty support for this and the vulkan backed always goes via the cpu. This is probubly the reason for the poor performance, @0cc4m knows exact details |
|
I don't know what it is exactly, I'd have to profile it. |
* llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <[email protected]> * cont : fix log type Co-authored-by: Johannes Gäßler <[email protected]> --------- Co-authored-by: Johannes Gäßler <[email protected]>
This PR adds support for backend-agnostic tensor parallelism, enabled via specifying
--split-mode tensor. This is done by adding a new "meta" backend that internally wraps multiple "simple" backends but can be used in the same way as a regular ggml backend.ggml Backend Interface Changes
This PR extends the ggml backend interface with some new functions for tensor copying:
set_tensor_2d_async/get_tensor_2d_asyncwhich are equivalent tomemcpy2DAsyncin CUDA. This is not needed for the computation of the meta backend itself but rather for setting/getting weights or the output. Currently not implemented, as a workaround the one-dimensional version is used in a loop.shfl_tensor_asyncto allow two ggml backends to exchange two tensors and to synchronize on the completion of the exchange. As a fallbackcpy_tensor_asynccan be used but this has a higher latency because the copy in one direction can only start once the one in the other direction has finished. Needed for a generic AllReduce between ggml backends. Implemented.allreduce_tensor_asyncto allow ggml backends to specify a backend-specific way to do an AllReduce operation. Intended to be used for NCCL support in cooperation with @gaugarg-nv . Not yet implemented.@slaren please provide feedback regarding whether you agree that these operations should be in the ggml backend interface. For context, all of them are optional and can use existing operations as a fallback.
Meta Backend Implementation Details
The meta backend implements an entire ggml backend stack starting from a meta device. The meta device is created from multiple simple backend devices as well as a function to determine how the data should be split across devices ("split states"). Backend buffer types, buffers, and backends are created as per usual. When calling
ggml_backend_graph_computethe code infers the split states of the nodes in the compute graph based on the split states assigned for the weights/kv cache. The basic pattern is to make all tensors mirrored by default. For the weight matrices, do a split in dimension 1, then a split in dimension 0, then an AllReduce. For a transformer this means two AllReduce operations, one after the attention and one after the FFN. The attention is effectively split by dimension 0, which equates to a split by attention head.An generic AllReduce operation is performed in the meta backend by splitting the graph into subgraphs. After a subgraph is executed, call
shfl_tensor_asyncto make backends exchange partial results, and then have them execute auxiliary graphs that contain only aGGML_ADDoperation to combine the results.The memory allocation for the compute graph is rather tricky - the way I solved it is to allocate the memory for the meta backend as per usual and to then transplant the calculated addresses relative to the backend buffer base pointer to the underlying simple backends. Because the simple tensors only require a fraction of the full memory this yields correct results, though it does result in overallocation for the compute graphs. For the weights/kv cache the memory allocation for the meta backend is done via a new function
ggml_backend_meta_alloc_ctx_tensors_from_buftto prevent duplicated weights (which are much larger in size). I'm not yet sure what the best approach will be long-term, I think the graph allocation code inggml-alloc.cwill need to be adjusted.Current Issues/Limitations
--tensor-splithas no effect.llama_params_fitis not implemented so the context size has to be set manually.llama_params_fit.ggml_tensor::datato dummy values since that is what is checked inggml-alloc.cto determine whether or not a tensor is considered allocated. This dummy value should never actually be used for any computations but I don't consider this a good solution.Performance
LLaMA 3 on 2x RTX 4090
Generally speaking it can be observed that parallelizing larger models has better performance than parallelizing smaller models. Similarly, parallelizing the model becomes more worthwhile as the context depth increases. This makes sense as both of these result in a larger workload per GPU vs. the overhead from parallelization. Token generation benefits more from parallelization than prompt processing because the amount of data that needs to be transferred between GPUs is proportional to batch size - long-term it may make sense to implement support for FP16/BF16 compute types which would count the I/O in half vs. FP32. For pp512 pipeline parallelism is effectively disabled while for pp2048 it's enabled. With pipeline parallelism
-sm layeris still faster than-sm tensoreven at high context depths.