Addresses #15409 - Support for NVIDIA Nemotron-H hybrid architecture models DRAFT#15572
Addresses #15409 - Support for NVIDIA Nemotron-H hybrid architecture models DRAFT#15572jwjohns wants to merge 22 commits intoggml-org:masterfrom
Conversation
- Add custom cache initialization filters for LLM_ARCH_NEMOTRON_H - Attention cache only allocated for layers 14, 21, 30, 39 (attention layers) - Recurrent cache only allocated for SSM layers using is_recurrent() - Reduces KV cache memory usage from 264MB (29 layers) to 64MB (4 layers) - Implements proper Mamba2-style SSM with x/z gating and SiLU activation - Resolves infinite hang issue during token generation
- fixed A/D tensor shapes from [128,1,1,1] to [1,128] - fixed conv1d dimensions to use actual 12288 not 17728 - fixed ssm_norm and ssm_out tensor sizes to use 10240 - fixed layer_types array type from uint8 to int32 - fixed gguf numpy array serialization - added missing template instantiations - model now loads to tensor validation stage - created working 18GB gguf file
that tries both orientations
|
I hit that same assertion. Here's what I see in my debugger: so the |
|
same for me |
|
The |
|
Is this a duplicate of #15507? |
|
@isaac-mcfadyen It is a duplicate. We both started working on it independently, so we have both open for reference until we converge on a single implementation since neither is working yet. |
apologies. I did wait rather than just create the duplicate immediately granted I should have set the PR to his fork and branch. |
| { LLM_KV_CLASSIFIER_OUTPUT_LABELS, "%s.classifier.output_labels" }, | ||
|
|
||
| // Nemotron-H specific | ||
| { LLM_KV_LAYER_TYPES, "%s.layer_types" }, |
There was a problem hiding this comment.
I think we can get away with not adding this new hparam. This is similar to a piece of feedback I got during #13550 (it's a looong PR, but it's in there somewhere). I had introduced a new array hparam similar to this one (mine was a bool), but @compilade pointed out that we could extract the same information by setting n_head_kv to an array value during conversion and then reading it per-layer (here). In this case, we can leverage n_ff in the same way so that the layer types are determined as:
n_head_kv == 0 && n_ff == 0=>recurrentn_head_kv == 0 && n_ff > 0=>MLPn_head_kv > 0 && n_ff == 0=>attentionn_head_kv >0 && n_ff > 0=> INVALID (or maybe valid for a future architecture??)
src/llama-graph.cpp
Outdated
| const int64_t n_rs = mctx->get_n_rs(); | ||
|
|
||
| if (s_copy) { | ||
| // Check if buffer was allocated - skip if not |
There was a problem hiding this comment.
I don't think anything about the architecture should require these changes to the core recurrent graph structures. Can you clarify what condition led you to adding these conditional checks?
There was a problem hiding this comment.
@gabe-l-hart You're right. This was a workaround for a crash where the graph
was trying to access recurrent state for all 56 layers, but i think Nemotron-H only allocates
recurrent state for the 27 SSM layers (not attention/MLP layers).
still learning.
| } | ||
|
|
||
| template bool llama_model_loader::get_arr<std::vector<std::string>>(enum llm_kv kid, std::vector<std::string> & result, bool required); | ||
| template bool llama_model_loader::get_arr<std::vector<unsigned char>>(enum llm_kv kid, std::vector<unsigned char> & result, bool required); |
There was a problem hiding this comment.
If we get rid of the new hparam, these template specializations won't be needed anymore (but nice job finding them, it took me a loong time to find them myself, and I have to re-find them every time)
| completion_token_output result; | ||
| result.tok = id; | ||
| result.text_to_send = common_token_to_piece(ctx, result.tok, accept_special_token(slot, result.tok)); | ||
| fprintf(stderr, "[DETOKENIZE] Token ID: %d -> Text: '%s' (length: %zu)\n", result.tok, result.text_to_send.c_str(), result.text_to_send.length()); |
There was a problem hiding this comment.
I love adding useful logs like this! This project has its own set of logging macros to use. In the core of the project, you've got LLAMA_LOG_* (defined here). In the server tool, there are three sets defined here: SVR_* for server-level logs, SLT_* for slot-level logs, and QUE_* for queue-level logs.
| for (; begin != end; ++begin) { | ||
| ret += common_token_to_piece(ctx, *begin); | ||
| std::string piece = common_token_to_piece(ctx, *begin); | ||
| fprintf(stderr, "[DEBUG] Token ID: %d -> Piece: '%s' (length: %zu)\n", *begin, piece.c_str(), piece.length()); |
There was a problem hiding this comment.
Same comment about logging for these ones. They're definitely useful, so it would be great to make these proper logs!
src/llama-model.cpp
Outdated
|
|
||
| // Try to load layer schedule from GGUF: %s.layer_types (0=SSM,1=ATTN,2=FFN) | ||
| std::vector<int32_t> layer_types; | ||
| const bool has_schedule = ml.get_arr(LLM_KV_LAYER_TYPES, layer_types, false) && layer_types.size() == hparams.n_layer; |
There was a problem hiding this comment.
If we leverage n_ff for this, this parsing gets a lot easier!
src/llama-model.cpp
Outdated
| ml.get_key(LLM_KV_ATTENTION_LAYERNORM_RMS_EPS, hparams.f_norm_rms_eps); | ||
|
|
||
| // Nemotron-H attention parameters (fixed per public config) | ||
| hparams.n_embd_head_k = 128; // attention head size |
There was a problem hiding this comment.
In a final version of this PR, it will be best to avoid these hard-coded numbers so that the architecture remains independent of the specific model instance we're building it for. Of course, when getting it all working these are fine and a good way to isolate variables between conversion and loading.
src/llama-model.cpp
Outdated
| const int64_t n_group = hparams.ssm_n_group; | ||
| const int64_t d_in_proj = 2*d_inner + 2*n_group*d_state + n_head; | ||
| // Use actual dimension from model: 22656 instead of calculated 22608 | ||
| const int64_t d_in_proj = 22656; // 2*d_inner + 2*n_group*d_state + n_head + 48; |
There was a problem hiding this comment.
On my PR, I avoided the need to hard code this by setting d_inner as mamba_num_heads (128) * mamba_head_dim (80), setting n_group to mamba_num_groups (8), setting d_state to mamba_state_dim (129), and setting n_head to head_dim (NOTE: not mamba_head_dim!). This works out to 2*128*80 + 2*8*128 + 128 == 22656.
| /* unified */ cparams.kv_unified, | ||
| /* filter_attn */ (arch == LLM_ARCH_FALCON_H1) ? [&](int32_t) { return true; } : (llama_memory_hybrid::layer_filter_cb)nullptr, | ||
| /* filter_recr */ (arch == LLM_ARCH_FALCON_H1) ? [&](int32_t) { return true; } : (llama_memory_hybrid::layer_filter_cb)nullptr); | ||
| /* filter_attn */ (arch == LLM_ARCH_FALCON_H1 || arch == LLM_ARCH_NEMOTRON_H) ? |
There was a problem hiding this comment.
I'm glad you found this! Since we now have n == 2 models that need this pattern, I've tried to make it a little cleaner by having a section of if/else cases to define architecture-specific filter lambdas (here)
There was a problem hiding this comment.
aaaand, it looks like my version is broken somehow! EDIT: fixed (sloppy typo)
|
i really appreciate the feedback! |
| # for security reason, we don't allow loading remote code by default | ||
| # if a model need remote code, we will fallback to config.json | ||
| config = AutoConfig.from_pretrained(dir_model, trust_remote_code=False).to_dict() | ||
| config = AutoConfig.from_pretrained(dir_model, trust_remote_code=True).to_dict() |
There was a problem hiding this comment.
I got tired of typing it. Temporary.
| print(f"DEBUG: Failed metadata key type: {type(val)}") | ||
| print(f"DEBUG: Failed metadata value: {val}") | ||
| print(f"DEBUG: Caller info available in stack trace") | ||
| raise ValueError(f"Invalid GGUF metadata array, expecting sequence but got {type(val)}: {val}") |
There was a problem hiding this comment.
more debug, didnt mean to commit. will clean up.
|
Thanks for the great work here! Closing this now that we've consolidated in #15507 |
|
@gabe-l-hart really appreciate the help!! |
This PR adds initial support for the Nemotron-H hybrid architecture used by NVIDIA Nemotron Nano V2
models. Nemotron-H combines Mamba2 state-space model layers with selective transformer attention layers
for efficient inference.
Note: This PR will remain in DRAFT status until full inference is working.
Architecture Details
current issue
SSM conv assertion failure during graph building
what works
Addresses #15409 - Request for Nemotron-H architecture support