ggml-cpu: use lookuptable for ggml op and parallelized some of the memcpy memset etc. calls before ggml_barriers#1101
Conversation
|
Is the |
|
I am going to remove the continuation stuff for now as part of this MR. I used it because I am locally experimenting with a scheduler that is more akin to what a game engine would use. And for that unfortunately all the barriers have to go, because calling in all the threads and synchronize them on the spot does not compose very well with the other parallelization primitives. I also ordered an dual socket Epyc to get a better idea of the NUMA memory traffic costs. I am probably bring my proposal (or similar) back in the future when I had more time to solidify my WiP code. If you like we can also talk about some ideas over vidcon. |
daa949d to
97d27e6
Compare
also parallelized some of the memset, memcpy etc. calls before some of the ggml_barriers.
97d27e6 to
c323c56
Compare
handle all calls to ggml_barrier at ggml_graph_compute_thread level and parallelized some of the memset, memcpy etc. calls before some of the barrier calls.
All barriers are now implemented by returning a continuation point. And than the tensor op is re-run very similar to how a coroutine would resume but with much less boilerplate as the tasks are simple enough that there is no need.
Ideally the code around barriers should be split up but I also see the benefit of keeping the code together, so this was the best compromise I could came up with while still being able to move the barrier in the code to a single place (within ggml-cpu).