Faster hybrid inference when shared experts#1191
Conversation
|
With this PR 1191 (and without #1190 merged), I gain a solid and precious 4-5% TG compared to without in Hybrid inference with GLM 4.6.
|
|
This is magic. With this I've finally been able to break the 11 t/s barrier after being at 10.5 for so long. The only issue I've found is that, for some reason, now after a long idle time (even without closing the server), the first generation is noticeably slower (like 9-9.5 t/s), but after that it works fine, I don't know why that happens now, it never slowed down that much in the first generation before. In any case, overall I'm happy with the speed. |
|
As an addendum to #1183 (comment), I get the following speeds with this PR. Still not as good as 2a7cc09, but slightly better than #1183. This PR is definitely working to improve speeds, but seems to be counteracted by the old PR.
|
This PR improves hybrid CPU/GPU performance for MoE models with shared experts (assuming the shared experts are in VRAM) when using split mode
graph.It is nothing major, but not entirely negligible either. I do see 4-5% better performance for GLM-4.5-AIR with all routed experts left on the CPU.
Oh, people running with tiny batch sizes are unlikely to see PP improvement.
Update
It looks like the positive effect is greater with more GPUs, at least for PP. As an example, below are results for GLM-4.7-3.35bpw running on 4x3090 and Ryzem-3995WX CPU, all routed experts left in RAM. Here I observe ~8% better PP performance.
Main branch
This PR