Conversation
4a6ad90 to
ad80e5a
Compare
|
This fixes it: --- a/llama.cpp
+++ b/llama.cpp
@@ -2423,6 +2423,7 @@ static struct ggml_cgraph * llm_build_llama(
ggml_element_size(kv_self.v)*n_ctx,
ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
+ V->src[1] = KQ_soft_max;
offload_func_v(V);
ggml_set_name(V, "V");It seems the concurrency detection does not take into account the indirect dependency of What would be a good solution for this problem? |
| for (int j = n_start; j < i; j++) { | ||
| if (nodes_unused[j] && gf->nodes[j]->op != GGML_OP_RESHAPE \ | ||
| && gf->nodes[j]->op != GGML_OP_VIEW \ | ||
| && gf->nodes[j]->op != GGML_OP_TRANSPOSE \ | ||
| && gf->nodes[j]->op != GGML_OP_PERMUTE) { | ||
| if (nodes_unused[j] && gf->nodes[j]->view_src == NULL) { |
There was a problem hiding this comment.
Need confirmation if this is correct.
Btw, I see that check_mem is always set to false.
Are there cases where we would still need it?
If it's not needed, should we remove it?
|
Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this: I don't really like the solution of |
| result->nb[i] = src->nb[i]; | ||
| } | ||
|
|
||
| result->op = GGML_OP_VIEW; |
There was a problem hiding this comment.
I added your proposal since it better expresses the KV cache dependency, until we figure out a better solution.
For this to work though, we need to set the ggml_view_tensor op to GGML_OP_VIEW as you have recently noted. I guess this change is fine
There was a problem hiding this comment.
In this case, we should also set src[0] to src in ggml_view_tensor, and add the additional dependency as src[1] instead in llm_build_graph. Otherwise, it would break old code that assumes that the source is in src[0], such as this in ggml-cuda:
https://github.com/ggerganov/llama.cpp/blob/bd33e5ab92e7f214205792fc1cd9ca28e810f897/ggml-cuda.cu#L6539-L6544
ggml_graph_import may also be an issue, and I am not sure that it will import the second dependency in src[1] correctly, since it just creates a new tensor calling ggml_view_4d. Newer code should use view_src instead.
There was a problem hiding this comment.
Ah good points. I guess I'll come back to this later and make a better solution.
Made an issue so we don't forget: ggml-org/ggml#502
Hello ! Could you share your code for implementing parallel execution with CUDA? |
See PR #2239, which the linked code snippet is from. |
This is supposed to work, but it breaks the results with Metal enabled:
Need to understand why and fix it