POC: CUDA tensor parallel (MoE models)#1022
Conversation
|
I only have big GLM. Will it work with hybrid inference? |
|
Not yet. I'm working on it. |
|
Having a garbage output when Details``` XXXXXXXXXXXXXXXXXXXXX Setting only active experts offload ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGraphExecUpdate. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x3090ae] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Program hit cudaErrorGraphExecUpdateFailure (error 910) due to "the graph update was not performed because it included changes which violated constraints specific to instantiated graph update" on CUDA API call to cudaGetLastError. ========= Saved host backtrace up to driver entry point at error ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x309437] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ```[EDIT]: with initcheck: Details``` ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (8,1,0) in block (45,0,0) ========= Address 0x7ff3a8000780 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (9,1,0) in block (45,0,0) ========= Address 0x7ff3a8000790 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (10,1,0) in block (45,0,0) ========= Address 0x7ff3a80007a0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (11,1,0) in block (45,0,0) ========= Address 0x7ff3a80007b0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (12,1,0) in block (45,0,0) ========= Address 0x7ff3a80007c0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (13,1,0) in block (45,0,0) ========= Address 0x7ff3a80007d0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (14,1,0) in block (45,0,0) ========= Address 0x7ff3a80007e0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (15,1,0) in block (45,0,0) ========= Address 0x7ff3a80007f0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (16,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a00 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (17,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a10 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (18,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a20 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (19,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a30 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (20,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a40 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (21,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a50 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (22,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a60 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (23,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a70 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (24,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a80 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (25,1,0) in block (45,0,0) ========= Address 0x7ff3a8000a90 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (26,1,0) in block (45,0,0) ========= Address 0x7ff3a8000aa0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (27,1,0) in block (45,0,0) ========= Address 0x7ff3a8000ab0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (28,1,0) in block (45,0,0) ========= Address 0x7ff3a8000ac0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (29,1,0) in block (45,0,0) ========= Address 0x7ff3a8000ad0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (30,1,0) in block (45,0,0) ========= Address 0x7ff3a8000ae0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (31,1,0) in block (45,0,0) ========= Address 0x7ff3a8000af0 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (0,3,0) in block (9,0,0) ========= Address 0x7ff3a8001200 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (1,3,0) in block (9,0,0) ========= Address 0x7ff3a8001210 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (2,3,0) in block (9,0,0) ========= Address 0x7ff3a8001220 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ========= Uninitialized __global__ memory read of size 16 bytes ========= at void flash_attn_mma_ext_f16<(int)128, (int)2, (int)4, (int)4, (int)64, (int)1, (bool)0>(const char *, const char *, const char *, const char *, const char *, const int2 *, float *, float2 *, float, float, float, float, float, unsigned int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int, int)+0x4c10 ========= by thread (3,3,0) in block (9,0,0) ========= Address 0x7ff3a8001230 ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame: cuGraphLaunch [0x3a1fef] in libcuda.so.1 ========= Host Frame: [0x2062a] in libcudart.so.13 ========= Host Frame: cudaGraphLaunch [0x76211] in libcudart.so.13 ========= Host Frame: ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) [0x307a69] in libggml.so ========= Host Frame: ggml_backend_sched_graph_compute_async [0x1943a1] in libggml.so ========= Host Frame: llama_decode [0x7da6c] in libllama.so ========= Host Frame: llama_init_from_gpt_params(gpt_params&) [0x20c4aa] in llama-server ========= Host Frame: server_context::load_model(gpt_params const&) [0xf3bb4] in llama-server ========= Host Frame: main [0x5cb8b] in llama-server ========= ``` |
|
[EDIT]: forgot to remove the cpu-moe flag. The details can be safely ignored. DetailsI do not see any garbage output with ```--split-mode graph``` enabled anymore.I will drop the sweep-benches tests below in a few hours or so. [EDIT]: As of now its the following: |
|
You didn't say what kind of a model you are using. There was still a bug with interleaved quants (*_R4, *_R8), this is fixed now. Also, tensor overrides appear to be working as of last commit. Here is what I have right now on the 2x3090 box. GLM-4.6, ubergarm's
|
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 64 | 0 | 10.280 | 398.43 | 6.474 | 9.89 |
| 4096 | 64 | 4096 | 10.776 | 380.12 | 7.101 | 9.01 |
| 4096 | 64 | 8192 | 11.205 | 365.54 | 6.962 | 9.19 |
| 4096 | 64 | 12288 | 11.719 | 349.51 | 7.257 | 8.82 |
| 4096 | 64 | 16384 | 12.258 | 334.16 | 7.487 | 8.55 |
| 4096 | 64 | 20480 | 12.869 | 318.28 | 7.548 | 8.48 |
| 4096 | 64 | 24576 | 13.346 | 306.90 | 7.691 | 8.32 |
| 4096 | 64 | 28672 | 13.986 | 292.87 | 8.178 | 7.83 |
Split mode "layer"
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 64 | 0 | 10.140 | 403.93 | 5.690 | 11.25 |
| 4096 | 64 | 4096 | 11.173 | 366.60 | 6.092 | 10.51 |
| 4096 | 64 | 8192 | 12.353 | 331.58 | 6.401 | 10.00 |
| 4096 | 64 | 12288 | 13.506 | 303.26 | 6.773 | 9.45 |
| 4096 | 64 | 16384 | 14.655 | 279.50 | 7.082 | 9.04 |
| 4096 | 64 | 20480 | 15.862 | 258.23 | 7.396 | 8.65 |
| 4096 | 64 | 24576 | 16.983 | 241.18 | 7.627 | 8.39 |
| 4096 | 64 | 28672 | 18.418 | 222.39 | 7.932 | 8.07 |
GLM-4.6, Thireus 5.5 bpw mix, all MoE tensors left in RAM
Split mode "graph"
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 64 | 0 | 18.361 | 223.08 | 8.018 | 7.98 |
| 4096 | 64 | 4096 | 18.736 | 218.61 | 8.004 | 8.00 |
| 4096 | 64 | 8192 | 19.252 | 212.76 | 8.175 | 7.83 |
| 4096 | 64 | 12288 | 19.740 | 207.49 | 8.381 | 7.64 |
| 4096 | 64 | 16384 | 20.270 | 202.07 | 8.472 | 7.55 |
| 4096 | 64 | 20480 | 20.771 | 197.20 | 8.652 | 7.40 |
| 4096 | 64 | 24576 | 21.349 | 191.86 | 8.938 | 7.16 |
| 4096 | 64 | 28672 | 22.012 | 186.08 | 8.946 | 7.15 |
Split mode "layer"
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 64 | 0 | 17.839 | 229.60 | 7.337 | 8.72 |
| 4096 | 64 | 4096 | 18.814 | 217.71 | 7.540 | 8.49 |
| 4096 | 64 | 8192 | 19.966 | 205.15 | 7.861 | 8.14 |
| 4096 | 64 | 12288 | 21.140 | 193.76 | 8.215 | 7.79 |
| 4096 | 64 | 16384 | 22.291 | 183.75 | 8.503 | 7.53 |
| 4096 | 64 | 20480 | 23.535 | 174.04 | 8.804 | 7.27 |
| 4096 | 64 | 24576 | 24.691 | 165.89 | 9.068 | 7.06 |
| 4096 | 64 | 28672 | 26.167 | 156.53 | 9.384 | 6.82 |
So, on my box PP is (almost) on par with split mode "layer" at zero context, and beats it by a non-negligible margin with increasing context length.
TG is not as good with hybrid. One needs to get to a sufficiently long context to have split mode "graph" being better than split mode "layer"
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
the graph is still not being computed in parallel. Why? Because the scheduler creates graph splits where the result of the computation on one GPU becomes an input for the other split. Hence, to trigger the computation on the second GPU one needs to wait for the computation on the first GPU to finish, even thiough the two can be done in parallel up to the sunchronization point. So, all that is left to do is to trick the scheduler to create to splits that can be done in parallel, and then have a graph split where the results get combined.
This change tricks it into doing the right thing^TM. Still quite a bit slower than split mode layer for the 8B LlaMA model. But for the 70B LlaMA it now beats split mode layer for TG: 28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s. In comparison, split mode "row" in mainline gets 484 t/s PP and 19.3 t/s TG.
Granularity for Wq, Wo is not just head size, but head size * gqa_ratio. Else the Wk, Wv tensors end up not being a multiple of the head size when we divide the split determined by Wo with the gqa_ratio.
but no tensor overrides yet, just ngl < num_layers.
Now PP is faster than split mode layer for L3-70B.
PP is already better than split mode layer, but TG for zero context is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.
It issues a warning that there is an extra semicolon outside of a function, but there isn't. If I remove the anonymous namespace and turn the functions inside into static, the warning disapears, so clearly a compiler bug.
Runs with wrong results, don't see where the issue could be.
Still does not work for row-interleaved quants
|
Thanks for these results. It looks like for a rig with 2 GPUs that does not have an ancient CPU it works quite well. Your results are even better than mine. At 64k context PP is 1.9X, so almost as good as it gets. |
|
Given I used the Q4_0 I ran the same test against mainline, the PP delta is pretty wild, not sure if there is something I could pass there to improve it.
👈 Detailsmodel=/mnt/raid/hf/GLM-4.5-Air-GGUF/Q4_0/GLM-4.5-Air-Q4_0-00001-of-00002.gguf
$ ./build/bin/llama-sweep-bench \
--model "$model"\
-c 69632 \
-ngl 99 \
-ub 4096 -b 4096 \
--threads 1
load_tensors: offloaded 48/48 layers to GPU
load_tensors: CPU_Mapped model buffer size = 333.00 MiB
load_tensors: CUDA0 model buffer size = 32937.69 MiB
load_tensors: CUDA1 model buffer size = 28166.12 MiB
....................................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 69632
llama_context: n_ctx_seq = 69632
llama_context: n_batch = 4096
llama_context: n_ubatch = 4096
llama_context: causal_attn = 1
llama_context: flash_attn = auto
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_seq (69632) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 6800.00 MiB
llama_kv_cache: CUDA1 KV buffer size = 5712.00 MiB
llama_kv_cache: size = 12512.00 MiB ( 69632 cells, 46 layers, 1/1 seqs), K (f16): 6256.00 MiB, V (f16): 6256.00 MiB
llama_context: pipeline parallelism enabled (n_copies=1)
llama_context: Flash Attention was auto, set to enabled
llama_context: CUDA0 compute buffer size = 2144.08 MiB
llama_context: CUDA1 compute buffer size = 2560.00 MiB
llama_context: CUDA_Host compute buffer size = 1152.11 MiB
llama_context: graph nodes = 3146
llama_context: graph splits = 3
main: n_kv_max = 69632, n_batch = 4096, n_ubatch = 4096, flash_attn_type = -1, n_gpu_layers = 99, n_threads = 1, n_threads_batch = 1
|
|
I could try to run the test on my system with 8 GPUs, however I am unable to locate the Edit: I was looking at the wrong user, I guess it's the unsloth one? |
To clarify my point I have to add the following... I was pointing out that in order to access the machine that does not have a dedicated IP one have to employ a NAT traversal technique. So the easiest example of it would be to use a Tailscale (which is opensource etc.). If one doesn't trust the code for one reason or another, one could always use something like VM ( Many alternative options do exist. For example, TOR. It's written in C and the code really easy to follow etc. The service provider would need to create an onion service and the client (you, potentially) could just install tor and use the utility
If so, I suggest the following. flowchart TD
A[Client: ikawrakow] -->|SSH with public key| B(VPS with Dedicated IP)
subgraph B [VPS Portal]
C[tmux Session 1]
D[tmux Session 2]
E[tmux Session ...]
end
C -->|autossh + eternal terminal<br>via TOR or similar NAT traversal| F[Target Machine 1]
D -->|similar setup| G[Target Machine 2]
E --> H[Other Machines]
Someone would need to order any kind of VPS with a dedicated IP address (these, with IPv6 are pretty cheap nowadays). So then, we could add your public ssh key to the The next step would be to use the terminal multiplexer like |
I didn't upload the one i am using which i quanted, you could probably use this one which is close enough: https://huggingface.co/bartowski/zai-org_GLM-4.5-Air-GGUF/tree/main/zai-org_GLM-4.5-Air-Q4_0 UPDATE |
|
It's fine I'm already using that one, just need to see the relative comparison I think |
|
So at least in its current configuration it does not seem to work correctly on my system, unless I did it wrong (just copied the commands you have there) System info: The GPUs only had like 15% usage during the test. |
|
What does the [EDIT]:
Waaah! That's pretty neat. Zen 2 single-CPU boards only support up to 4 PCIe Gen4 x16. |
Your test looks correct to me and I see you correctly duplicated my example commands including the first working condition with default split mode and Not sure if using any kind of explicit I had one other tester on Beaver AI discord tell me:
So yours seems to run but is not performing correctly... hrm.. This feature is still pretty new so thanks for testing with your cool rig! Not sure at the moment what you could try next. Maybe try it with just 2x GPUs and see if that works? |
Have he tried to run with things like |
It's basically flawless: Click to expand |
Whoa this is nuts!! You basically achieved the NvLink speed of RTX 3090 with the driver of yours. GJ, LGMT |
|
Current gen nvlink is (much) faster than this, but for our single user inference tasks it does not matter. The actual data exchanged is tiny, we just want the low latency. I have the GPUs mounted in a custom open frame I built out of alu extrusions, so they all get good air and it's actually nice and quiet. Using MCIO cables from 10gtek and device adapters from C-Payne in "Bottom" style for short traces. It's kinda off topic for this thread though. I've shared ~all about this setup (and the previous one with pcie gen4, problems I hit and how I fixed them, etc) in the old TheBloke and ExLlama discord servers if you want to build something like it |
2 GPUs won't be big enough to load that model, but let's try 4 from the same NUMA node later |
|
Now that you have posted comparisons with mainline, it looks like something goes wrong with FA for TG. In all of my testing The PP results are as expected. |
|
Thanks for testing on your amazing system! I think we have established that the current TP implementation is only useful on 2 GPUs, and becomes completely inadequate on 4 or more GPUs. If you want to test again, I think you will get best results with 4 GPUs, but using only 2 of them for TP, and the other 2 for computing the routed experts in part of the layers. Something like this: You may need to adjust the layers on |
|
In testing it seems like the default I see a few PRs came through in the past couple hours and I'll go read them next to catch up. So either something was updated, or perhaps either I made a mistake or something on the rig wasn't quite right after first updating all the drivers yesterday.. huh.. EDIT it is also faster PP on mainline today than yesterday with exact same commit (i checked my logs and the git sha looks right, so maybe the GPUs had something on them I didn't notice or something was wonky)...
I added a small patch built on tip of the latest main@f4def9b3 and running it is printing out: DEBUG: /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:140: ggml_cuda_flash_attn_ext_mma_f16This suggests the only print statement coming from here: https://github.com/ikawrakow/ik_llama.cpp/blob/main/ggml/src/ggml-cuda/fattn.cu#L125 👈 Patchdiff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu
index 83c7cf40..4d522c40 100644
--- a/ggml/src/ggml-cuda/fattn.cu
+++ b/ggml/src/ggml-cuda/fattn.cu
@@ -16,6 +16,10 @@
#include <cstdint>
+#define DEBUG(msg) \
+ fprintf(stderr, "DEBUG: %s:%d: ", __FILE__, __LINE__); \
+ fprintf(stderr, "%s\n", msg);
+
#define FATTN_KQ_STRIDE 256
static inline bool mma_better_than_turing(const int cc) {
@@ -55,8 +59,10 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
// On AMD the tile kernels perform poorly, use the vec kernel instead:
if (cc >= CC_OFFSET_AMD) {
if (precision == GGML_PREC_DEFAULT && fast_fp16_available(cc)) {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f16");
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
} else {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f32");
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
}
return;
@@ -64,8 +70,10 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
if (!fast_fp16_available(cc)) {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f32");
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
} else {
+ DEBUG("ggml_cuda_flash_attn_ext_tile_f32");
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
}
return;
@@ -74,14 +82,18 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
if (!fp16_mma_available(cc)) {
if (precision == GGML_PREC_DEFAULT) {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f16");
ggml_cuda_flash_attn_ext_vec_f16(ctx, dst);
} else {
+ DEBUG("ggml_cuda_flash_attn_ext_tile_f16");
ggml_cuda_flash_attn_ext_tile_f16(ctx, dst);
}
} else {
if (Q->ne[1] <= 8 || Q->ne[0] == 256) {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f32");
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
} else {
+ DEBUG("ggml_cuda_flash_attn_ext_tile_f32");
ggml_cuda_flash_attn_ext_tile_f32(ctx, dst);
}
}
@@ -96,6 +108,7 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
const bool mma_faster_for_bs1 = new_mma_available(cc) && gqa_opt_applies && !(Q->ne[1] == 1 && n_swa > 0);
const bool can_use_vector_kernel = Q->ne[0] <= 256 && Q->ne[0] % (2*WARP_SIZE) == 0;
if (Q->ne[1] == 1 && can_use_vector_kernel && !mma_faster_for_bs1 && !ggml_is_quantized(K->type) && !ggml_is_quantized(V->type)) {
+ DEBUG("ggml_cuda_flash_attn_ext_vec_f32");
ggml_cuda_flash_attn_ext_vec_f32(ctx, dst);
return;
}
@@ -107,6 +120,7 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
// so no other implementation works.
//
if (new_mma_available(cc) && ((K->ne[0] == 576 && V->ne[0] == 512) || (K->ne[0] == 192 && V->ne[0] == 128 && mma_better_than_turing(cc)))) {
+ DEBUG("ggml_cuda_flash_attn_ext_mma_new");
ggml_cuda_flash_attn_ext_mma_new(ctx, dst);
return;
}
@@ -117,11 +131,13 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst
// We also need it if the new MMA is not available
//
if (!new_mma_available(cc) || K->ne[0] != V->ne[0]) {
+ DEBUG("ggml_cuda_flash_attn_ext_wmma_f16");
ggml_cuda_flash_attn_ext_wmma_f16(ctx, dst);
return;
}
// As mentioned above, the new-new MMA is slower then the new MMA.
+ DEBUG("ggml_cuda_flash_attn_ext_mma_f16");
ggml_cuda_flash_attn_ext_mma_f16(ctx, dst);
//ggml_cuda_flash_attn_ext_mma_new(ctx, dst);
}I didn't go back and re-test on the previous commit, but will leave this here for now and go read up on the new stuff you merged. hah.. Thanks! EDIT here is same graph as above but with just with today's measurements which look better than yesterday for some reason...
|
|
Thanks! Yes, this is the kernel that is supposed to get used. I was concerned that somehow the vector kernel was getting invoked for TG, and that's why we were seeing such a performance decline with context (more than expected). So, in your case, split mode "graph" looks like a real winner. |
I guess, in that case I would try to learn how to use tailscale. When I wrote that I would never pipe a script that I downloaded from the internet into |
Shalom! I am having the absolute same thoughts with RTX 3090 now )) |
Ahh!! Got it now. Fully agree! The explanation is highly appreciated. Very cool point! Thanks!! |
It turned out that Tailscale does not work for us. So we dropped the attempts to make it work and instead we just used the ngrok. ngrok setup: curl -s https://ngrok-agent.s3.amazonaws.com/ngrok.asc | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null
echo "deb https://ngrok-agent.s3.amazonaws.com buster main" | sudo tee /etc/apt/sources.list.d/ngrok.list
sudo apt update
sudo apt install ngrokSign up & get token: ngrok tcp 22It gives you something like tcp://0.tcp.ngrok.io:12345 ssh [email protected] -p 12345No special client software needed. Regular SSH works. That would require providing the debit card details without the CVV (they do not charge any money; as stated they're using this to combat the fraud cases etc.). So the perfect solution is still to buy the VPS for the crypto. But ngrok works for free. So its privacy vs time invested dilemma again. |
|
Oh wow that's interesting.. i wonder if they take debit gift cards. I guess I will have to see. |
|
How complex it will be to add a new model support for tensor parallel based on this work? Is it worth to try add MiniMax-M2 seris model using vibe coding? |
Hahaha!!! You're funny lad! :)) |
|
I’m not familiar with the logic of LLM inference. The only way I can help for this part is vibe coding and validate the result. : ) |
|
@hksdpc255 I appreciate that you want to help, but I cannot say that I like the current trend of vibe-coded contributions. In general, I think if someone wants to seriously contribute to a project, they need to become familiar with at least the parts of the project where they want to contribute. Even more so when it comes to building the compute graph, as this is a core part of the inference engine. |
|
None of my contributions are vibe-coded unless explicitly stated. Since you mentioned that you prefer not to rely on vibe-coding for unfamiliar parts of the codebase, I will avoid doing so accordingly. |
|
Big issue with vibecoding is people don't read what the LLM spit out and barely test if it works. |






This is a very rough around the edges POC for tensor parallelism (TP) for MoE models. It is a follow up of PR #1018.
I have the necessary graph building changes only for GLM-4.5/4.6 just to see how it performs. On a 2x3090 system I can only fully offload a low-bpw quantized GLM-4.5-AIR (full offload is needed as tensor overrides are not yet implemented in this new scheme). I'm using @ubergarm's
IQ1_KTmodel, which is 38.7 GB, and allows me to go to about 32k tokens of context (with f16 KV cache).Here performance results are mixed. For PP-2048 the new split mode "graph" implementation beats split mode "layer" for all context lengths, being as much as 60% faster at a context of 30k tokens. TG on the other hand is significantly slower at zero context, and only becomes faster than "layer" around a context of 20k tokens. Here are the
sweep-benchresultsSplit mode "graph"
Split mode "layer"