Addresses #15409 - Support for NVIDIA Nemotron-H hybrid architecture models DRAFT by jwjohns · Pull Request #15572 · ggml-org/llama.cpp

jwjohns · 2025-08-25T17:41:11Z

This PR adds initial support for the Nemotron-H hybrid architecture used by NVIDIA Nemotron Nano V2
models. Nemotron-H combines Mamba2 state-space model layers with selective transformer attention layers
for efficient inference.

Note: This PR will remain in DRAFT status until full inference is working.

Architecture Details

Total layers: 56 (27 SSM + 25 MLP + 4 attention at positions [14, 21, 30, 39])
Hybrid pattern: Alternating Mamba2 SSM layers with periodic attention layers
Key components: SSM conv1d, A/D parameters, normalization, x/z gating with SiLU

current issue

SSM conv assertion failure during graph building

GGML_ASSERT(sx->ne[1] == d_inner) failed in ggml_ssm_conv, the second dimension
   of tensor sx doesn't match the expected d_inner value.

what works

Model loading works - the tensor dimension issue is resolved
All tensors loaded correctly - no more "wrong shape" errors

Addresses #15409 - Request for Nemotron-H architecture support

- Add custom cache initialization filters for LLM_ARCH_NEMOTRON_H - Attention cache only allocated for layers 14, 21, 30, 39 (attention layers) - Recurrent cache only allocated for SSM layers using is_recurrent() - Reduces KV cache memory usage from 264MB (29 layers) to 64MB (4 layers) - Implements proper Mamba2-style SSM with x/z gating and SiLU activation - Resolves infinite hang issue during token generation

- fixed A/D tensor shapes from [128,1,1,1] to [1,128] - fixed conv1d dimensions to use actual 12288 not 17728 - fixed ssm_norm and ssm_out tensor sizes to use 10240 - fixed layer_types array type from uint8 to int32 - fixed gguf numpy array serialization - added missing template instantiations - model now loads to tensor validation stage - created working 18GB gguf file

that tries both orientations

gabe-l-hart · 2025-08-25T17:46:09Z

I hit that same assertion. Here's what I see in my debugger:

   236 	// ggml_print_backtrace is registered with std::set_terminate by ggml.cpp
(lldb) up 1
frame #4: 0x000000010110db30 libggml-base.dylib`ggml_ssm_conv(ctx=0x00006000022add40, sx=0x000000016015d3e0, c=0x00000001010585e0) at ggml.c:5033:5
   5030	
   5031	    // TODO: maybe support other strides than 1?
   5032	    GGML_ASSERT(sx->ne[0] == d_conv - 1 + n_t);
-> 5033	    GGML_ASSERT(sx->ne[1] == d_inner);
   5034	    GGML_ASSERT(n_t >= 0);
   5035	
   5036	    struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_inner, n_t, n_s);
(lldb) p sx->ne[1]
(int64_t) 17728
(lldb) p d_inner
(const int64_t) 12288

so the sx->ne[1] is the full computed conv shape (versus the overwritten one)

jwjohns · 2025-08-25T17:56:46Z

same for me

/Development/Nemotron/llama.cpp/ggml/src/ggml.c-5031-    // TODO: maybe support
     other strides than 1?
     Development/Nemotron/llama.cpp/ggml/src/ggml.c-5032-    GGML_ASSERT(sx->ne[0] ==
     d_conv - 1 + n_t);
     Development/Nemotron/llama.cpp/ggml/src/ggml.c:5033:    GGML_ASSERT(sx->ne[1] ==
     d_inner);
     /Development/Nemotron/llama.cpp/ggml/src/ggml.c-5034-    GGML_ASSERT(n_t >= 0);
     /Development/Nemotron/llama.cpp/ggml/src/ggml.c-5035-
     /Development/Nemotron/llama.cpp/ggml/src/ggml.c-5036-    struct ggml_tensor * result
      = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_inner, n_t, n_s);

gabe-l-hart · 2025-08-25T18:09:40Z

The sx->ne[0] dimension is coming from conv_x here

isaac-mcfadyen · 2025-08-25T21:00:38Z

Is this a duplicate of #15507?

gabe-l-hart · 2025-08-25T21:02:42Z

@isaac-mcfadyen It is a duplicate. We both started working on it independently, so we have both open for reference until we converge on a single implementation since neither is working yet.

jwjohns · 2025-08-26T12:26:55Z

Is this a duplicate of #15507?

apologies. I did wait rather than just create the duplicate immediately granted I should have set the PR to his fork and branch.

gabe-l-hart · 2025-08-26T14:57:28Z

src/llama-arch.cpp

    { LLM_KV_CLASSIFIER_OUTPUT_LABELS, "%s.classifier.output_labels" },

+    // Nemotron-H specific
+    { LLM_KV_LAYER_TYPES, "%s.layer_types" },


I think we can get away with not adding this new hparam. This is similar to a piece of feedback I got during #13550 (it's a looong PR, but it's in there somewhere). I had introduced a new array hparam similar to this one (mine was a bool), but @compilade pointed out that we could extract the same information by setting n_head_kv to an array value during conversion and then reading it per-layer (here). In this case, we can leverage n_ff in the same way so that the layer types are determined as:

n_head_kv == 0 && n_ff == 0 => recurrent

n_head_kv == 0 && n_ff > 0 => MLP

n_head_kv > 0 && n_ff == 0 => attention

n_head_kv >0 && n_ff > 0 => INVALID (or maybe valid for a future architecture??)

gabe-l-hart · 2025-08-26T14:58:47Z

src/llama-graph.cpp

    const int64_t n_rs = mctx->get_n_rs();

    if (s_copy) {
+        // Check if buffer was allocated - skip if not


I don't think anything about the architecture should require these changes to the core recurrent graph structures. Can you clarify what condition led you to adding these conditional checks?

@gabe-l-hart You're right. This was a workaround for a crash where the graph
was trying to access recurrent state for all 56 layers, but i think Nemotron-H only allocates
recurrent state for the 27 SSM layers (not attention/MLP layers).

still learning.

gabe-l-hart · 2025-08-26T14:59:40Z

src/llama-model-loader.cpp

    }

    template bool llama_model_loader::get_arr<std::vector<std::string>>(enum llm_kv kid, std::vector<std::string> & result, bool required);
+    template bool llama_model_loader::get_arr<std::vector<unsigned char>>(enum llm_kv kid, std::vector<unsigned char> & result, bool required);


If we get rid of the new hparam, these template specializations won't be needed anymore (but nice job finding them, it took me a loong time to find them myself, and I have to re-find them every time)

gabe-l-hart · 2025-08-26T15:03:51Z

tools/server/server.cpp

                completion_token_output result;
                result.tok          = id;
                result.text_to_send = common_token_to_piece(ctx, result.tok, accept_special_token(slot, result.tok));
+                fprintf(stderr, "[DETOKENIZE] Token ID: %d -> Text: '%s' (length: %zu)\n", result.tok, result.text_to_send.c_str(), result.text_to_send.length());


I love adding useful logs like this! This project has its own set of logging macros to use. In the core of the project, you've got LLAMA_LOG_* (defined here). In the server tool, there are three sets defined here: SVR_* for server-level logs, SLT_* for slot-level logs, and QUE_* for queue-level logs.

gabe-l-hart · 2025-08-26T15:04:26Z

tools/server/utils.hpp

    for (; begin != end; ++begin) {
-        ret += common_token_to_piece(ctx, *begin);
+        std::string piece = common_token_to_piece(ctx, *begin);
+        fprintf(stderr, "[DEBUG] Token ID: %d -> Piece: '%s' (length: %zu)\n", *begin, piece.c_str(), piece.length());


Same comment about logging for these ones. They're definitely useful, so it would be great to make these proper logs!

gabe-l-hart · 2025-08-26T15:05:27Z

src/llama-model.cpp

+
+                // Try to load layer schedule from GGUF: %s.layer_types (0=SSM,1=ATTN,2=FFN)
+                std::vector<int32_t> layer_types;
+                const bool has_schedule = ml.get_arr(LLM_KV_LAYER_TYPES, layer_types, false) && layer_types.size() == hparams.n_layer;


If we leverage n_ff for this, this parsing gets a lot easier!

gabe-l-hart · 2025-08-26T15:06:43Z

src/llama-model.cpp

+                ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps);
+
+                // Nemotron-H attention parameters (fixed per public config)
+                hparams.n_embd_head_k = 128;  // attention head size


In a final version of this PR, it will be best to avoid these hard-coded numbers so that the architecture remains independent of the specific model instance we're building it for. Of course, when getting it all working these are fine and a good way to isolate variables between conversion and loading.

gabe-l-hart · 2025-08-26T15:15:45Z

src/llama-model.cpp

                    const int64_t n_group = hparams.ssm_n_group;
-                    const int64_t d_in_proj = 2*d_inner + 2*n_group*d_state + n_head;
+                    // Use actual dimension from model: 22656 instead of calculated 22608
+                    const int64_t d_in_proj = 22656; // 2*d_inner + 2*n_group*d_state + n_head + 48;


On my PR, I avoided the need to hard code this by setting d_inner as mamba_num_heads (128) * mamba_head_dim (80), setting n_group to mamba_num_groups (8), setting d_state to mamba_state_dim (129), and setting n_head to head_dim (NOTE: not mamba_head_dim!). This works out to 2*128*80 + 2*8*128 + 128 == 22656.

gabe-l-hart · 2025-08-26T15:24:10Z

src/llama-model.cpp

                        /* unified           */ cparams.kv_unified,
-                        /* filter_attn       */ (arch == LLM_ARCH_FALCON_H1) ? [&](int32_t) { return true; } : (llama_memory_hybrid::layer_filter_cb)nullptr,
-                        /* filter_recr       */ (arch == LLM_ARCH_FALCON_H1) ? [&](int32_t) { return true; } : (llama_memory_hybrid::layer_filter_cb)nullptr);
+                        /* filter_attn       */ (arch == LLM_ARCH_FALCON_H1 || arch == LLM_ARCH_NEMOTRON_H) ? 


I'm glad you found this! Since we now have n == 2 models that need this pattern, I've tried to make it a little cleaner by having a section of if/else cases to define architecture-specific filter lambdas (here)

aaaand, it looks like my version is broken somehow! EDIT: fixed (sloppy typo)

jwjohns · 2025-08-26T18:19:22Z

i really appreciate the feedback!

jwjohns · 2025-08-26T21:36:31Z

convert_hf_to_gguf.py

            # for security reason, we don't allow loading remote code by default
            # if a model need remote code, we will fallback to config.json
-            config = AutoConfig.from_pretrained(dir_model, trust_remote_code=False).to_dict()
+            config = AutoConfig.from_pretrained(dir_model, trust_remote_code=True).to_dict()


I got tired of typing it. Temporary.

jwjohns · 2025-08-26T21:37:08Z

gguf-py/gguf/gguf_writer.py

+                print(f"DEBUG: Failed metadata key type: {type(val)}")
+                print(f"DEBUG: Failed metadata value: {val}")
+                print(f"DEBUG: Caller info available in stack trace")
+                raise ValueError(f"Invalid GGUF metadata array, expecting sequence but got {type(val)}: {val}")


more debug, didnt mean to commit. will clean up.

gabe-l-hart · 2025-08-28T19:20:30Z

Thanks for the great work here! Closing this now that we've consolidated in #15507

jwjohns · 2025-08-29T12:13:09Z

@gabe-l-hart really appreciate the help!!

jwjohns and others added 14 commits August 23, 2025 15:20

attempt at implementing nemotron_h architecture.

cce8cb1

update

f1acd11

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

175d60e

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

3a99e79

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

1f55ace

working on the ssm tensors sizing

62accf9

still isnt working though progress is being made

cc9b929

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

657903a

cleanup docs

3df06e6

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

154459a

resolving tensor dimensions

ca4c978

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

a556953

jwjohns mentioned this pull request Aug 25, 2025

Feature Request: Support for NVidia Nemotron Nano v2 #15409

Closed

4 tasks

implement a custom tensor creation

e2b0dda

that tries both orientations

github-actions bot added examples python python script changes server labels Aug 25, 2025

update shapes to nvidia safetensors ground truth

0d9725c

gabe-l-hart reviewed Aug 26, 2025

View reviewed changes

code review cleanup

3efbb74

jwjohns commented Aug 26, 2025

View reviewed changes

jwjohns and others added 5 commits August 27, 2025 08:13

Merge branch 'ggml-org:master' into feature/nemotron-h-support-working

743681b

convert_hf_to_gguf.py

bfc234d

cleanup debug logs and hardcoded portions

2ebaa43

cleanup

497d73b

Applying the SSM_SCAN fix for n_groups > 1

7c668fd

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Aug 27, 2025

gabe-l-hart mentioned this pull request Aug 28, 2025

nvidia nemotron nano v2 (nemotronh) #15507

Merged

jwjohns mentioned this pull request Aug 28, 2025

Nemotron h naming update gabe-l-hart/llama.cpp#4

Merged

gabe-l-hart closed this Aug 28, 2025

Conversation

jwjohns commented Aug 25, 2025

Architecture Details

current issue

what works

Uh oh!

gabe-l-hart commented Aug 25, 2025

Uh oh!

jwjohns commented Aug 25, 2025

Uh oh!

gabe-l-hart commented Aug 25, 2025

Uh oh!

isaac-mcfadyen commented Aug 25, 2025

Uh oh!

gabe-l-hart commented Aug 25, 2025

Uh oh!

jwjohns commented Aug 26, 2025

Uh oh!

gabe-l-hart Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jwjohns commented Aug 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gabe-l-hart commented Aug 28, 2025

Uh oh!

jwjohns commented Aug 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

gabe-l-hart Aug 26, 2025 •

edited

Loading

gabe-l-hart Aug 26, 2025 •

edited

Loading