-
Notifications
You must be signed in to change notification settings - Fork 15.8k
Description
Name and Version
latest master 418dea3
Operating systems
Linux
GGML backends
CUDA
Hardware
4 CUDA devices connected via pci-e 3.0 x1
Models
Any model of qwen35 or qwen35moe
Problem description & steps to reproduce
Graph split is very unstable because of different cache types assigned to layers.
I'll write some examples using freshly released Qwen3.5 27B
- Split is happening between layers with RS cache.
| model | size | params | backend | threads | n_ubatch | fa | ts | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 16.00/24.00/16.00/9.00 | pp10000 | 369.05 ± 0.08 |
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 16.00/24.00/16.00/9.00 | tg512 | 17.64 ± 0.03 |
- Same as 1, but with my RS cache fix which assigns R cache of layer i to backend of layer i-1. I thoroughly investigated graph and made this fix specifically for Qwen3.5-397B-A17B because otherwise PP was very slow. It was something like 10-20 t/s before fix and 170 t/s after.
| model | size | params | backend | threads | n_ubatch | fa | ts | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 16.00/24.00/16.00/9.00 | pp10000 | 489.34 ± 0.14 |
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 16.00/24.00/16.00/9.00 | tg512 | 17.04 ± 0.09 |
Here it gives +100 t/s.
- Split is happening around ordinary kv-cache layers. No amount of extra transfers are happening
| model | size | params | backend | threads | n_ubatch | fa | ts | test | t/s |
|---|---|---|---|---|---|---|---|---|---|
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 3.00/4.00/2.70/1.00 | pp10000 | 635.23 ± 0.38 |
| qwen35 ?B Q8_0 | 26.62 GiB | 26.90 B | CUDA,BLAS | 7 | 2048 | 1 | 3.00/4.00/2.70/1.00 | tg512 | 18.59 ± 0.00 |
Almost twice as fast as 1.
- Cuda error case (like this Eval bug: CUDA error on Qwen3.5-27B #19860)
For me it happens when-ts "3;4;1.5;2.7;1"(5 devices)
Graph is looking very weird. I copy-pasted part of it between CUDA1 and CUDA2 layers where all extra weird transfers happens.
Graph part
node #17871 ( MUL_MAT): v_prime-37 ( 1M) [CUDA1 ] use=1,c=1: k_cumdecay-37 (view) ( 47M) [CUDA1 ] dnet_add_ch_state-37 ( 3M) [CUDA1 ]
node #17872 ( SUB): v_t_new-37 ( 1M) [CUDA1 ] use=2,c=1: (transposed) (cont) (v ( 47M) [CUDA1 ] v_prime-37 ( 1M) [CUDA1 ]
node #17873 ( MUL_MAT): node_17873 ( 3M) [CUDA1 ] use=1,c=1: key_gdiff_t-37 (view) ( 47M) [CUDA1 ] v_t_new-37 ( 1M) [CUDA1 ]
node #17874 ( ADD): dnet_add_ch_state-37 ( 3M) [CUDA1 ] use=1,c=1: node_17867 ( 3M) [CUDA1 ] node_17873 ( 3M) [CUDA1 ]
node #17878 ( CPY): cache_s_l37 (view) (cop ( 3M) [CUDA1 ] use=0,c=1: new_state-37 ( 3M) [CUDA1 ] cache_s_l37 (view) ( 3M) [CUDA1 ]
SPLIT #5: CUDA2 # 30 inputs: [ (view) ( 0K)] [ (view) ( 0K)] [q_conv-37 ( 79M)] [ (transposed) ( 384K)] [dnet_add_ch_state-37 ( 3M)] [ (reshaped) ( 48M)] [decay_mask-37 ( 24M)] [v_t_new-37 ( 1M)] [node_17582 ( 48M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)]
node #17881 ( SCALE): conv_states_updated-38 ( 120K) [CUDA2 ] use=0,c=1: conv_states_updated-38 ( 120K) [CUDA2 ]
node #17882 ( GET_ROWS): conv_states-38 ( 120K) [CUDA2 ] use=1,c=1: conv_states_updated-38 ( 120K) [CUDA2 ] CUDA2# (view)#0 ( 0K) [ NULL ]
node #17883 ( GET_ROWS): node_17883 ( 0K) [CUDA2 ] use=1,c=1: conv_states_updated-38 ( 120K) [CUDA2 ] CUDA2# (view)#0 ( 0K) [ NULL ]
node #17885 ( CPY): conv_states_updated-38 ( 0K) [CUDA2 ] use=0,c=1: node_17883 ( 0K) [CUDA2 ] conv_states_updated-38 ( 0K) [CUDA2 ]
node #17888 ( L2_NORM): node_17888 ( 16M) [CUDA2 ] use=1,c=1: CUDA2#q_conv-37#0 ( 79M) [ NULL ]
node #17889 ( REPEAT): q_conv_predelta-37 ( 48M) [CUDA2 ] use=1,c=1: node_17888 ( 16M) [CUDA2 ]
node #17890 ( SCALE): q_in-37 ( 48M) [CUDA2 ] use=1,c=1: q_conv_predelta-37 ( 48M) [CUDA2 ]
node #17892 ( PAD): node_17892 ( 48M) [CUDA2 ] use=1,c=1: q_in-37 (permuted) ( 48M) [CUDA2 ]
node #17895 ( CONT): (transposed) (cont) ( 384K) [CUDA2 ] use=1,c=1: CUDA2# (transposed)#0 ( 384K) [ NULL ]
node #17896 ( MUL): node_17896 ( 48M) [CUDA2 ] use=32,c=1: (reshaped) ( 48M) [CUDA2 ] (transposed) (cont) ( 384K) [CUDA2 ]
node #17898 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17899 ( MUL_MAT): node_17899 ( 24M) [CUDA2 ] use=1,c=1: CUDA2# (reshaped)#0 ( 48M) [ NULL ] (reshaped) ( 48M) [CUDA2 ]
node #17900 ( MUL): node_17900 ( 24M) [CUDA2 ] use=1,c=1: node_17899 ( 24M) [CUDA2 ] CUDA2#decay_mask-37#0 ( 24M) [ NULL ]
node #17901 ( TRI): kq-37 ( 24M) [CUDA2 ] use=32,c=1: node_17900 ( 24M) [CUDA2 ]
node #17903 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17904 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17905 ( SET): (view) ( 48M) [CUDA2 ] use=1,c=1: CUDA2#node_17582#0 ( 48M) [ NULL ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17907 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17909 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17910 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17911 ( SET): (view) (view) ( 48M) [CUDA2 ] use=1,c=1: (view) ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17913 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17915 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17916 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17917 ( SET): (view) (view) (view) ( 48M) [CUDA2 ] use=1,c=1: (view) (view) ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17919 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17921 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17922 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17923 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17925 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17927 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17928 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17929 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17931 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17933 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17934 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17935 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17937 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17939 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17940 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17941 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17943 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17945 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17946 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17947 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17949 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17951 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17952 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17953 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17955 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17957 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17958 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17959 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17961 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17963 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17964 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17965 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17967 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
SPLIT #6: CUDA2 # 30 inputs: [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)]
node #17969 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17970 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17971 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17973 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17975 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17976 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17977 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17979 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17981 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17982 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17983 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17985 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17987 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17988 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17989 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17991 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17993 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #17994 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #17995 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #17997 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #17999 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18000 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18001 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18003 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18005 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18006 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18007 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18009 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18011 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18012 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18013 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18015 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18017 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18018 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18019 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18021 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18023 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18024 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18025 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18027 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18029 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18030 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18031 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18033 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18035 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18036 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18037 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18039 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18041 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18042 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18043 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18045 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18047 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18048 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18049 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18051 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18053 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18054 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18055 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18057 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
SPLIT #7: CUDA2 # 11 inputs: [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)] [dnet_add_ch_state-37 ( 3M)] [v_t_new-37 ( 1M)]
node #18059 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18060 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18061 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18063 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18065 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18066 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18067 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18069 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18071 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18072 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18073 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18075 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18077 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18078 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18079 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18081 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18083 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18084 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18085 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
node #18087 ( MUL_MAT): attn_inter-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#dnet_add_ch_state ( 3M) [ NULL ] (view) ( 47M) [CUDA2 ]
node #18089 ( MUL_MAT): v_attn-37 ( 1M) [CUDA2 ] use=1,c=1: CUDA2#v_t_new-37#0 ( 1M) [ NULL ] kq-37 (view) ( 23M) [CUDA2 ]
node #18090 ( ADD): dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ] use=1,c=1: attn_inter-37 ( 1M) [CUDA2 ] v_attn-37 ( 1M) [CUDA2 ]
node #18091 ( SET): (view) (view) (view) ( ( 48M) [CUDA2 ] use=1,c=1: (view) (view) (view) ( ( 48M) [CUDA2 ] dnet_add_ch_attn_out-37 ( 1M) [CUDA2 ]
SPLIT #8: CUDA1 # 0 inputs
node #18094 ( RMS_NORM): norm-37 ( 48M) [CUDA1 ] use=1,c=1: attn_output-37 ( 48M) [CUDA1 ]
node #18095 ( MUL): node_18095 ( 48M) [CUDA1 ] use=1,c=1: norm-37 ( 48M) [CUDA1 ] blk.37.ssm_norm.weight ( 0K) [CUDA1 ]
node #18096 ( MUL_MAT): z-37 ( 48M) [CUDA1 ] use=1,c=1: blk.37.attn_gate.weight ( 31M) [CUDA1 ] attn_norm-37 ( 40M) [CUDA1 ]
node #18098 ( UNARY): node_18098 ( 48M) [CUDA1 ] use=1,c=1: z-37 (reshaped) ( 48M) [CUDA1 ]
node #18099 ( MUL): node_18099 ( 48M) [CUDA1 ] use=1,c=1: node_18095 ( 48M) [CUDA1 ] node_18098 ( 48M) [CUDA1 ]
node #18101 ( MUL_MAT): linear_attn_out-37 ( 40M) [CUDA1 ] use=1,c=1: blk.37.ssm_out.weight ( 31M) [CUDA1 ] final_output-37 ( 48M) [CUDA1 ]
node #18103 ( ADD): attn_residual-37 ( 40M) [CUDA1 ] use=2,c=1: linear_attn_out-37 (res ( 40M) [CUDA1 ] post_ffn-36 ( 40M) [CUDA1 ]
node #18104 ( RMS_NORM): norm-37 ( 40M) [CUDA1 ] use=1,c=1: attn_residual-37 ( 40M) [CUDA1 ]
node #18105 ( MUL): attn_post_norm-37 ( 40M) [CUDA1 ] use=2,c=1: norm-37 ( 40M) [CUDA1 ] blk.37.post_attention_n ( 20K) [CUDA1 ]
node #18106 ( MUL_MAT): ffn_gate-37 ( 136M) [CUDA1 ] use=1,c=1: blk.37.ffn_gate.weight ( 90M) [CUDA1 ] attn_post_norm-37 ( 40M) [CUDA1 ]
node #18107 ( MUL_MAT): ffn_up-37 ( 136M) [CUDA1 ] use=1,c=1: blk.37.ffn_up.weight ( 90M) [CUDA1 ] attn_post_norm-37 ( 40M) [CUDA1 ]
node #18108 ( GLU): ffn_swiglu-37 ( 136M) [CUDA1 ] use=1,c=1: ffn_gate-37 ( 136M) [CUDA1 ] ffn_up-37 ( 136M) [CUDA1 ]
node #18109 ( MUL_MAT): ffn_out-37 ( 40M) [CUDA1 ] use=1,c=1: blk.37.ffn_down.weight ( 90M) [CUDA1 ] ffn_swiglu-37 ( 136M) [CUDA1 ]
node #18110 ( ADD): post_ffn-37 ( 40M) [CUDA1 ] use=2,c=1: ffn_out-37 ( 40M) [CUDA1 ] attn_residual-37 ( 40M) [CUDA1 ]
SPLIT #9: CUDA2 # 5 inputs: [post_ffn-37 ( 40M)] [leaf_55 ( 32K)] [leaf_59 ( 16K)] [leaf_61 ( 16K)] [ (copy) ( 40M)]
node #18111 ( RMS_NORM): norm-38 ( 40M) [CUDA2 ] use=1,c=1: CUDA2#post_ffn-37#0 ( 40M) [ NULL ]
node #18112 ( MUL): attn_norm-38 ( 40M) [CUDA2 ] use=4,c=1: norm-38 ( 40M) [CUDA2 ] blk.38.attn_norm.weight ( 20K) [CUDA2 ]
node #18113 ( MUL_MAT): node_18113 ( 80M) [CUDA2 ] use=1,c=1: blk.38.attn_qkv.weight ( 53M) [CUDA2 ] attn_norm-38 ( 40M) [CUDA2 ]
node #18116 ( CONCAT): conv_input-38 ( 80M) [CUDA2 ] use=2,c=1: conv_states_reshaped-38 ( 120K) [CUDA2 ] qkv_mixed_transposed-38 ( 80M) [CUDA2 ]
node #18119 ( CPY): state_update_target-38 ( 120K) [CUDA2 ] use=0,c=1: last_conv_states-38 ( 80M) [CUDA2 ] state_update_target-38 ( 120K) [CUDA2 ]
node #18122 ( SCALE): cache_s_l38 (reshaped) ( 3M) [CUDA2 ] use=0,c=1: cache_s_l38 (reshaped) ( 3M) [CUDA2 ]
node #18123 ( GET_ROWS): node_18123 ( 3M) [CUDA2 ] use=1,c=1: cache_s_l38 (reshaped) ( 3M) [CUDA2 ] CUDA2# (view)#0 ( 0K) [ NULL ]
node #18124 ( GET_ROWS): node_18124 ( 0K) [CUDA2 ] use=1,c=1: cache_s_l38 (reshaped) ( 3M) [CUDA2 ] CUDA2# (view)#0 ( 0K) [ NULL ]
node #18126 ( CPY): cache_s_l38 (view) (cop ( 0K) [CUDA2 ] use=0,c=1: node_18124 ( 0K) [CUDA2 ] cache_s_l38 (view) ( 0K) [CUDA2 ]
node #18129 ( CONT): dnet_add_ch_state-38 ( 3M) [CUDA2 ] use=3,c=1: state_predelta-38 (tran ( 3M) [CUDA2 ]
node #18130 ( MUL_MAT): node_18130 ( 384K) [CUDA2 ] use=1,c=1: blk.38.ssm_alpha.weight ( 255K) [CUDA2 ] attn_norm-38 ( 40M) [CUDA2 ]
node #18131 ( CONT): alpha-38 ( 384K) [CUDA2 ] use=1,c=1: node_18130 ( 384K) [CUDA2 ]
node #18132 ( ADD): node_18132 ( 384K) [CUDA2 ] use=1,c=1: alpha-38 ( 384K) [CUDA2 ] blk.38.ssm_dt.bias ( 0K) [CUDA2 ]
node #18133 ( UNARY): a_softplus-38 ( 384K) [CUDA2 ] use=1,c=1: node_18132 ( 384K) [CUDA2 ]
node #18134 ( MUL): gate-38 ( 384K) [CUDA2 ] use=1,c=1: a_softplus-38 ( 384K) [CUDA2 ] blk.38.ssm_a ( 0K) [CUDA2 ]
node #18137 ( PAD): node_18137 ( 384K) [CUDA2 ] use=1,c=1: g_in-38 (permuted) ( 384K) [CUDA2 ]
node #18140 ( CONT): (reshaped) (transposed ( 384K) [CUDA2 ] use=1,c=1: (reshaped) (transposed ( 384K) [CUDA2 ]
node #18141 ( CUMSUM): g_cs-38 ( 384K) [CUDA2 ] use=5,c=1: (reshaped) (transposed ( 384K) [CUDA2 ]
node #18143 ( CONT): g_last-38 (cont) ( 6K) [CUDA2 ] use=2,c=1: g_last-38 ( 383K) [CUDA2 ]
node #18144 ( UNARY): node_18144 ( 6K) [CUDA2 ] use=1,c=1: g_last-38 (cont) ( 6K) [CUDA2 ]
node #18147 ( MUL): node_18147 ( 3M) [CUDA2 ] use=1,c=1: dnet_add_ch_state-38 ( 3M) [CUDA2 ] g_last_exp_t-38 (view) ( 5K) [CUDA2 ]
node #18148 ( SSM_CONV): conv_output_raw-38 ( 80M) [CUDA2 ] use=1,c=1: conv_input-38 ( 80M) [CUDA2 ] blk.38.ssm_conv1d.weigh ( 160K) [CUDA2 ]
node #18149 ( UNARY): conv_output_silu-38 ( 80M) [CUDA2 ] use=3,c=1: conv_output_raw-38 ( 80M) [CUDA2 ]
node #18151 ( L2_NORM): node_18151 ( 16M) [CUDA2 ] use=1,c=1: k_conv-38 ( 79M) [CUDA2 ]
node #18152 ( REPEAT): k_in-38 ( 48M) [CUDA2 ] use=1,c=1: node_18151 ( 16M) [CUDA2 ]
node #18154 ( PAD): node_18154 ( 48M) [CUDA2 ] use=2,c=1: k_in-38 (permuted) ( 48M) [CUDA2 ]
node #18156 ( SUB): node_18156 ( 384K) [CUDA2 ] use=1,c=1: g_cs-38 ( 384K) [CUDA2 ] g_last-38 (cont) ( 6K) [CUDA2 ]
node #18157 ( UNARY): g_diff-38 ( 384K) [CUDA2 ] use=1,c=1: node_18156 ( 384K) [CUDA2 ]
node #18158 ( UNARY): node_18158 ( 384K) [CUDA2 ] use=1,c=1: g_diff-38 ( 384K) [CUDA2 ]
node #18160 ( CONT): (transposed) (cont) ( 384K) [CUDA2 ] use=1,c=1: (transposed) ( 384K) [CUDA2 ]
node #18161 ( MUL): key_gdiff-38 ( 48M) [CUDA2 ] use=1,c=1: (reshaped) ( 48M) [CUDA2 ] (transposed) (cont) ( 384K) [CUDA2 ]
node #18163 ( CONT): key_gdiff_t-38 ( 48M) [CUDA2 ] use=32,c=1: key_gdiff-38 (transpose ( 48M) [CUDA2 ]
node #18167 ( PAD): node_18167 ( 48M) [CUDA2 ] use=1,c=1: v_in-38 (permuted) ( 79M) [CUDA2 ]
node #18168 ( MUL_MAT): node_18168 ( 384K) [CUDA2 ] use=1,c=1: blk.38.ssm_beta.weight ( 255K) [CUDA2 ] attn_norm-38 ( 40M) [CUDA2 ]
node #18170 ( UNARY): b_in-38 ( 384K) [CUDA2 ] use=1,c=1: beta-38 ( 384K) [CUDA2 ]
node #18172 ( PAD): node_18172 ( 384K) [CUDA2 ] use=2,c=1: b_in-38 (permuted) ( 384K) [CUDA2 ]
node #18173 ( MUL): v_b-38 ( 48M) [CUDA2 ] use=1,c=1: node_18167 ( 48M) [CUDA2 ] node_18172 ( 384K) [CUDA2 ]
node #18176 ( CONT): v_b-38 (reshaped) (tran ( 48M) [CUDA2 ] use=1,c=1: v_b-38 (reshaped) (tran ( 48M) [CUDA2 ]
node #18177 ( MUL): k_b-38 ( 48M) [CUDA2 ] use=1,c=1: node_18154 ( 48M) [CUDA2 ] node_18172 ( 384K) [CUDA2 ]
node #18179 ( MUL_MAT): node_18179 ( 24M) [CUDA2 ] use=1,c=1: (reshaped) ( 48M) [CUDA2 ] k_b-38 (reshaped) ( 48M) [CUDA2 ]
node #18181 ( REPEAT): node_18181 ( 24M) [CUDA2 ] use=1,c=1: g_cs-38 (reshaped) ( 384K) [CUDA2 ]
node #18182 ( SUB): node_18182 ( 24M) [CUDA2 ] use=1,c=1: node_18181 ( 24M) [CUDA2 ] g_cs-38 ( 384K) [CUDA2 ]
node #18183 ( TRI): node_18183 ( 24M) [CUDA2 ] use=1,c=1: node_18182 ( 24M) [CUDA2 ]
node #18184 ( UNARY): decay_mask-38 ( 24M) [CUDA2 ] use=2,c=1: node_18183 ( 24M) [CUDA2 ]
node #18185 ( MUL): node_18185 ( 24M) [CUDA2 ] use=1,c=1: node_18179 ( 24M) [CUDA2 ] decay_mask-38 ( 24M) [CUDA2 ]
First Bad Commit
No response
Relevant log output
logs above