Add falcon h1 #2650

dhiaEddineRhaiem · 2025-05-29T12:55:16Z

This PR introduces support for the FalconH1 model family within the Unsloth library.

It addresses the following issue.

Training integration has been validated using the following command:

python unsloth-cli.py --model_name "tiiuae/Falcon-H1-0.5B-Base" \
  --max_seq_length 2048 --dtype bfloat16 --load_in_4bit \
  --r 64 --lora_alpha 32 --lora_dropout 0.1 --bias "none" \
  --per_device_train_batch_size 1

Note: Inference support is currently under active development and is being debugged.
@danielhanchen , do you think we can first have FalconH1 training supported for the community in unsloth and then raise a separate PR to fix inference?

Datta0 · 2025-05-31T06:36:25Z

unsloth/models/falcon_h1.py

+        FalconH1ForCausalLM,
+        FalconHybridMambaAttentionDynamicCache,
+    )
+except:


@danielhanchen I think we should consolidate all these imports and version checks into one file that gets called at. init and have the flags IS_QWEN_SUPPORTED or IS_FALCON_SUPPORTED passed over to the individual files...
Thoughts?

I agree - we'll do it for a future release

Datta0 · 2025-05-31T06:51:12Z

unsloth/models/falcon_h1.py

+                K = K.view(1, K_M, n_heads, head_dim)
+                V = V.view(1, V_M, n_heads, head_dim)
+            pass
+        else:


Are there any efficiency gains in doing this? If not, can we use the same structure for both cases?

Actually from internal checks, yes - view is faster than reshape, since reshape does a copy possibly

Oh actually I meant checking for requires_grad and doing separate operations for both...
If I understand correctly you mean, we can get away with view for inference but its not compatible with training or something?
Otherwise why can't we use view for both

Datta0 · 2025-05-31T06:55:16Z

unsloth/models/falcon_h1.py

+        K_M = V_M = bsz * kv_seq_len
+        Q_M = bsz * q_len
+
+        has_swa = isinstance(causal_mask, xformers.attn_bias.BlockDiagonalCausalMask)


Falcon doesn't seem to have SWA (correct me if I'm wrong). We can remove this safely

True , removing it

Datta0 · 2025-05-31T07:00:23Z

unsloth/models/falcon_h1.py

+    past_key_value = (K, V) if use_cache else None
+
+    # Attention module
+    if (not HAS_FLASH_ATTENTION and attention_mask is None):


Also, @danielhanchen we should move the attention computation logic to single file.
Any pre post processing can exist in the model specific files...

Yes agreed - was planning to do that!

Datta0 · 2025-05-31T07:29:46Z

unsloth/models/falcon_h1.py

+        )
+        mamba_hidden_states = mamba_hidden_states * self.ssm_out_multiplier
+
+        hidden_states = mamba_hidden_states + attention_hidden_states


Can we combined L406, L414 and this into single operation if possible and please check if gradients flow properly ? OR
Lets reorder the operations the way they do in transformers to look consistent..

Datta0 · 2025-05-31T07:31:45Z

unsloth/models/falcon_h1.py

+            (see `past_key_values`).
+        past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
+    """
+    if use_cache and hasattr(self, "_flag_for_generation"):


@danielhanchen we should also refactor these a little.
Instead of having diff code blocks doing entirely same thing (except norm), we can consolidate

if use_cache and hasattr(self, "_flag_for_generation"): layernorm = fast_rms_layernorm_inference else: layernorm = fast_rms_layernorm ... Rest of the code together

Datta0 · 2025-05-31T07:33:28Z

unsloth/models/falcon_h1.py

+    return outputs
+pass
+
+def _FalconH1_fast_forward_inference(attention_fast_forward_inference=LlamaAttention_fast_forward_inference, mlp_fast_forward_inference=fast_swiglu_inference):


NIT: Let the default be FalconAttention_fast_forward_inference

younesbelkada · 2025-05-31T09:40:50Z

unsloth/models/falcon_h1.py

+    hidden_states:       torch.Tensor,
+    causal_mask          = None,
+    attention_mask:      Optional[torch.Tensor] = None,
+    mamba_attention_mask:      Optional[torch.Tensor] = None,


@Datta0 how unsloth handles forward kwargs ? Does it gets them directly from HF transformers ?
Context: for Mamba, the mamba_mask is created here: https://github.com/huggingface/transformers/blob/51d732709e5ae424e8fb6c4e58b72057a3e413c2/src/transformers/models/falcon_h1/modeling_falcon_h1.py#L1305 - will we need to add a similar logic in LlamaModel_fast_forward to incorporate the mamba mask?

Yeah if mamba's decoder layer expects mamba_mask, then its the job of FalconModel's forward function to handle that.
If the model's forward functionality is similar to llama with this being the only change, we can use LlamaModel's forward while checking for model type to be falcon

Datta0 · 2025-06-09T05:57:38Z

@dhiaEddineRhaiem
Thanks for the changes. everything looks good to me. If you can provide a small sample script to verify that our outputs match HF's (for training a few steps and inference), that'd be great and we can get this merged

dhiaEddineRhaiem · 2025-06-09T06:19:50Z

Many Thanks @Datta0 ,
thanks also to @younesbelkada for contributing to this.
i will be providing that shortly.

dhiaEddineRhaiem · 2025-06-18T11:46:32Z

hey @Datta0 ,
for training:

with unsloth training, we got this loss for first 8 steps using following command :

python unsloth-cli.py --model_name "tiiuae/Falcon-H1-0.5B-Base" --max_seq_length 2048 --dtype bfloat16 --load_in_4bit --r 64 --lora_alpha 32 --lora_dropout 0.1 --bias "none" --per_device_train_batch_size 1

{'loss': 1.765, 'grad_norm': 0.7072669863700867, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                     
{'loss': 1.6431, 'grad_norm': 0.7720947265625, 'learning_rate': 4e-05, 'epoch': 0.0}                                                                                     
{'loss': 2.8533, 'grad_norm': 1.7554939985275269, 'learning_rate': 8e-05, 'epoch': 0.0}                                                                                  
{'loss': 1.4332, 'grad_norm': 0.6112934947013855, 'learning_rate': 0.00012, 'epoch': 0.0}                                                                                
{'loss': 1.751, 'grad_norm': 0.4668850898742676, 'learning_rate': 0.00016, 'epoch': 0.0}                                                                                 
{'loss': 1.5516, 'grad_norm': 0.3460712432861328, 'learning_rate': 0.0002, 'epoch': 0.0}                                                                                 
{'loss': 1.8194, 'grad_norm': 0.49015215039253235, 'learning_rate': 0.00019949367088607596, 'epoch': 0.0}                                                                
{'loss': 1.8264, 'grad_norm': 0.3050038516521454, 'learning_rate': 0.0001989873417721519, 'epoch': 0.0}

with following hf script ( using same settings), we ,using following command,

python hf-cli.py --model_name "tiiuae/Falcon-H1-0.5B-Base" --max_seq_length 2048 --load_in_4bit --r 64 --lora_alpha 32 --lora_dropout 0.1 --bias "none" --per_device_train_batch_size 1

got :

{'loss': 1.7969, 'grad_norm': 0.7279049158096313, 'learning_rate': 0.0, 'epoch': 0.0}                                                                                                             
{'loss': 1.6988, 'grad_norm': 0.7694766521453857, 'learning_rate': 4e-05, 'epoch': 0.0}                                                                                                           
{'loss': 2.9951, 'grad_norm': 1.805290937423706, 'learning_rate': 8e-05, 'epoch': 0.0}                                                                                                            
{'loss': 1.4775, 'grad_norm': 0.6704585552215576, 'learning_rate': 0.00012, 'epoch': 0.0}                                                                                                         
{'loss': 1.7497, 'grad_norm': 0.4692195653915405, 'learning_rate': 0.00016, 'epoch': 0.0}                                                                                                         
{'loss': 1.587, 'grad_norm': 0.37044161558151245, 'learning_rate': 0.0002, 'epoch': 0.0}                                                                                                          
{'loss': 1.8655, 'grad_norm': 0.520369291305542, 'learning_rate': 0.00019949367088607596, 'epoch': 0.0}                                                                                           
{'loss': 1.879, 'grad_norm': 0.3259218633174896, 'learning_rate': 0.0001989873417721519, 'epoch': 0.0}

the small differences may be the result of the numerical differences between the kernels in unsloth vs native kernels on HF + potentially LoRA initializations not being consistent

For inference:
We can address the inference part in a follow up PR as it will require some time to make it work (still working on our end on that)

danielhanchen · 2025-06-23T11:58:10Z

@dhiaEddineRhaiem Oh apologies on the delay - is it possible to get a plot in say Excel / Google Sheets of the HF loss vs Unsloth loss over 60 steps

if they mainly match, then we can merge this :) Thanks!

danielhanchen · 2025-06-23T11:58:42Z

unsloth/models/falcon_h1.py

+        FalconH1ForCausalLM,
+        FalconHybridMambaAttentionDynamicCache,
+    )
+except:


I agree - we'll do it for a future release

danielhanchen · 2025-06-23T11:58:58Z

unsloth/models/falcon_h1.py

+    past_key_value = (K, V) if use_cache else None
+
+    # Attention module
+    if (not HAS_FLASH_ATTENTION and attention_mask is None):


Yes agreed - was planning to do that!

danielhanchen · 2025-06-23T11:59:42Z

unsloth/models/falcon_h1.py

+                K = K.view(1, K_M, n_heads, head_dim)
+                V = V.view(1, V_M, n_heads, head_dim)
+            pass
+        else:


Actually from internal checks, yes - view is faster than reshape, since reshape does a copy possibly

unsloth/models/llama.py

Co-authored-by: Daniel Han <[email protected]>

danielhanchen · 2025-06-23T12:15:57Z

Oh also is inference ok? I'm assuming a new PR will be needed?

Also if it's possible to make a notebook and place it in the Unsloth notebooks repo, that'll be also cool - also reminder to add "contributed by the folks at TII" for example and link you website / HF page for extra recognition :)

dhiaEddineRhaiem · 2025-06-23T12:18:51Z

@danielhanchen, many thx for the review.
on our side , we will be preparing the excel and the notebook quickly.

for the Inference , we think it might be better to raise it in a separate PR.

danielhanchen · 2025-06-23T12:22:08Z

Ok cool! I think it's cause it's mamba right - so inference is difference since there's no KV cache - a dumb approach is if inference is enabled, don't use the fast path, and use the original mamba forward

dhiaEddineRhaiem · 2025-06-23T12:33:33Z

Falcon H1 is hybrid ( parallel design with mamba and attention ).
For that, we designed a new FalconHybridMambaAttentionDynamicCache , a dynamic cache manager that tracks:

key_cache / value_cache for attention layers
conv_states / ssm_states for Mamba layers

we are still debugging on our side to make it work as expected.

dhiaEddineRhaiem · 2025-06-26T10:52:38Z

@danielhanchen , @Datta0 , we managed to do 60 steps for FalconH1 with unsloth and hf,
this is a plot for the losses obtained with a dataframe containing all:
loss_comparison.csv

also, what is the best way to push the notebook ? (inside hf or maybe inside in this repo ?

Datta0

Thanks a lot for the work. Have a few minor comments. Rest looks good, especially the losses matching with HF :)

Datta0 · 2025-06-26T14:03:30Z

unsloth/models/falcon_h1.py

+                K = K.view(1, K_M, n_heads, head_dim)
+                V = V.view(1, V_M, n_heads, head_dim)
+            pass
+        else:


Oh actually I meant checking for requires_grad and doing separate operations for both...
If I understand correctly you mean, we can get away with view for inference but its not compatible with training or something?
Otherwise why can't we use view for both

Datta0 · 2025-06-26T14:11:05Z

unsloth/models/falcon_h1.py

+    hidden_states:       torch.Tensor,
+    causal_mask          = None,
+    attention_mask:      Optional[torch.Tensor] = None,
+    mamba_attention_mask:      Optional[torch.Tensor] = None,


Yeah if mamba's decoder layer expects mamba_mask, then its the job of FalconModel's forward function to handle that.
If the model's forward functionality is similar to llama with this being the only change, we can use LlamaModel's forward while checking for model type to be falcon

Datta0 · 2025-06-26T14:14:17Z

unsloth/models/llama.py

    IS_GRANITE = self.config.model_type.startswith("granite")
+    IS_FALCON_H1 = self.config.model_type.startswith("falcon_h1")
+
+    if IS_FALCON_H1:


NIT: Either move this to L790/803 or move that here as the code is very similar
maybe we can do

if IS_FALCON_H1 or IS_GRANITE: inputs_embeds * = self.config.embedding_multiplier

Datta0 · 2025-06-26T14:15:02Z

unsloth/models/llama.py

-            (fast_rms_layernorm_inference_gemma if IS_GEMMA else fast_rms_layernorm_inference)\
-            (self.norm, hidden_states)
+        if IS_FALCON_H1:
+                    hidden_states = fast_rms_layernorm_inference(self.final_layernorm, hidden_states)


NIT: Fix indentation

Datta0 · 2025-06-26T14:16:28Z

unsloth/models/llama.py

        lm_head_device = lm_head.device

+        if self.config.model_type == "falcon_h1":
+            lm_head = lm_head * self.config.lm_head_multiplier


Do we multiply lm_head or do we multiply the logits?
If the latter, we can look at merging this with L1189/L1205

…nsloth into add_falcon_h1

younesbelkada · 2025-06-27T11:32:54Z

Done addressing the comments @Datta0 ! Let us know if all is good

Datta0

LGTM. Great work :)

dhiaEddineRhaiem · 2025-06-27T16:21:57Z

Many Thanks @Datta0 @danielhanchen @younesbelkada

danielhanchen · 2025-06-28T11:53:42Z

Great work!

dhiaEddineRhaiem added 2 commits May 29, 2025 05:33

add falcon h1

56b7d3a

feat: add Falcon-H1 into unsloth

a8a5566

Datta0 reviewed May 31, 2025

View reviewed changes

address comments

4a2d6d5

younesbelkada reviewed May 31, 2025

View reviewed changes

dhiaEddineRhaiem requested a review from Datta0 June 8, 2025 18:36

fix

a062f08

danielhanchen requested changes Jun 23, 2025

View reviewed changes

dhiaEddineRhaiem and others added 2 commits June 23, 2025 13:08

Update unsloth/models/llama.py

851f0b8

Co-authored-by: Daniel Han <[email protected]>

Update unsloth/models/llama.py

d236f62

Co-authored-by: Daniel Han <[email protected]>

dhiaEddineRhaiem and others added 3 commits June 24, 2025 12:27

Merge remote-tracking branch 'upstream/main' into add_falcon_h1

b08542f

Merge branch 'main' into add_falcon_h1

bcd03f6

fixes

61e2156

Datta0 reviewed Jun 26, 2025

View reviewed changes

younesbelkada and others added 3 commits June 27, 2025 08:51

Merge branch 'main' into add_falcon_h1

4d82431

fix comments

c302a98

Merge branch 'add_falcon_h1' of https://github.com/dhiaEddineRhaiem/u…

9002c29

…nsloth into add_falcon_h1

Datta0 approved these changes Jun 27, 2025

View reviewed changes

danielhanchen merged commit a8f3d69 into unslothai:main Jun 28, 2025

younesbelkada mentioned this pull request Jun 30, 2025

[Feature] Falcon H1 support #2622

Open

Uh oh!

Add falcon h1 #2650

Add falcon h1 #2650

Uh oh!

Conversation

dhiaEddineRhaiem commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Datta0 commented Jun 9, 2025

Uh oh!

dhiaEddineRhaiem commented Jun 9, 2025

Uh oh!

dhiaEddineRhaiem commented Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielhanchen commented Jun 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielhanchen commented Jun 23, 2025

Uh oh!

dhiaEddineRhaiem commented Jun 23, 2025

Uh oh!

danielhanchen commented Jun 23, 2025

Uh oh!

dhiaEddineRhaiem commented Jun 23, 2025

Uh oh!

dhiaEddineRhaiem commented Jun 26, 2025

Uh oh!

Datta0 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

younesbelkada commented Jun 27, 2025

Uh oh!

Datta0 left a comment

Choose a reason for hiding this comment

dhiaEddineRhaiem commented May 29, 2025 •

edited

Loading

dhiaEddineRhaiem commented Jun 18, 2025 •

edited

Loading