Training a Vision model on a Text-Only Dataset #3199

ppraval · 2025-10-05T09:47:13Z

ppraval
Oct 5, 2025

I'm planning to fine-tune LLaMA 3.2 11B Instruct on a JSONL dataset of domain-specific question-answer pairs — purely text, no images. The goal is to improve its instruction-following behavior for specialized text tasks, while still retaining its ability to handle multimodal inputs like OCR and image-based queries.

in examples we have a sample .yaml file for this

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
# optionally might have model_type or tokenizer_type or processor_type
processor_type: AutoProcessor
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name


# these 3 lines are needed for now to handle vision chat templates w images
skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

chat_template: llama3_2_vision
datasets:
  - path: HuggingFaceH4/llava-instruct-mix-vsft
    type: chat_template
    split: train[:1%]
dataset_prepared_path:
val_set_size: 0.0
output_dir: ./outputs/out

adapter: lora
lora_model_dir:

sequence_len: 8192
pad_to_sequence_len: false

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_modules: 'model.language_model.layers.[\d]+.(mlp|cross_attn|self_attn).(up|down|gate|q|k|v|o)_proj'

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1
# flash_attention: true  # use for text-only mode
sdp_attention: true

warmup_ratio: 0.1
evals_per_epoch: 1
saves_per_epoch: 1
weight_decay: 0.0

# save_first_step: true  # uncomment this to validate checkpoint saving works with your config

based on which I have made a similar .yaml file

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

# Vision-chat template handling
# skip_prepare_dataset: true
# remove_unused_columns: false
# sample_packing: false

chat_template: llama3_2_vision

datasets:
  - path: <path_to_dataset>
    type: chat_template
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      system: 
        - system
      user: 
        - user
      assistant: 
        - assistant
    train_on_inputs: false

output_dir: <path_to_output_directory>

# Training parameters
sequence_len: 8192
pad_to_sequence_len: false
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 1

optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002
weight_decay: 0.0
warmup_ratio: 0.1

# Precision & performance
bf16: true
fp16:
tf32: true

gradient_checkpointing: true
logging_steps: 1
flash_attention: true   # text-only mode
# sdp_attention: true

# Checkpointing
evals_per_epoch: 1
saves_per_epoch: 1
save_first_step: true
save_total_limit: 3

weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

but when i run
axolotl train config.yaml
and I have processor_type:

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

I get the error
KeyError: 'Indexing with integers is not available when using Python based feature extractors'

but when i remove the field

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

or even

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
processor_type: AutoProcessor
tokenizer_config: <path_to_custom_tokenizer>

# Vision-chat template handling
skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

I get the error
AttributeError: 'MllamaTextSelfAttention' object has no attribute 'is_causal'

What happened here?
How does one do this?
Will this fine-tuning lead to loss of Vision Capabilities of the model?
Is there a guide to writing config.yaml files for different models?

Python Version: 3.12
Axolotl Version: Latest
Dataset: a .jsonl with

{
	"messages": 
	[
		{"role": "system", "content": "<system_prompt>"}, 
		{"role": "user", "content": "<question>"}, 
		{"role": "assistant", "content": "<answer>"}
	]
}

which was previously used to fine tune Llama3.1 8B using the following config.yaml

base_model: NousResearch/Meta-Llama-3.1-8B-Instruct
tokenizer_config: <path_to_custom_tokenizer>
tokenizer_type: AutoTokenizer

chat_template: llama3
datasets:
  - path: <path_to_dataset>
    type: chat_template
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      system:
        - system
      user:
        - user
      assistant:
        - assistant
train_on_inputs: false

output_dir: <path_to_output_directory>

sequence_len: 2048
sample_packing: true


gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 4

optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
save_only_model: false


logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 3
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

Thank you.

Answered by ppraval

Oct 7, 2025

Hi @NanoCode012, thank you very much for the clear and helpful guidance! The note about processor_type: AutoProcessor and switching to a regular LLaMA3 config was exactly what I needed. Appreciate your time and insight

View full answer

NanoCode012 · 2025-10-06T04:01:16Z

NanoCode012
Oct 6, 2025
Maintainer

Hello, you're almost there. The key is processor_type: AutoProcessor. If you set this, it enter multi-modal training mode. You'd want text-only.

What you should instead do, is take a regular llama3 config (like the nous research one), change model path, and train your data on it. That's all :)

You can do a short training on mllama then just verify you can load it with some inference tools if you're worried about vision being lost.

5 replies

ppraval Oct 7, 2025
Author

Hi @NanoCode012, thank you very much for the clear and helpful guidance! The note about processor_type: AutoProcessor and switching to a regular LLaMA3 config was exactly what I needed. Appreciate your time and insight

Answer selected by NanoCode012

ppraval Oct 10, 2025
Author

@NanoCode012 I have another question, if you don't mind helping me understand

I used a standard llama3 config

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
tokenizer_config: ./itai_tokenizer
tokenizer_type: AutoTokenizer

chat_template: llama3
datasets:
  - path: ./income_tax_finetune.jsonl
    type: chat_template
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      system:
        - system
      user:
        - user
      assistant:
        - assistant
train_on_inputs: false

output_dir: ./outputs/it_1_text_only

sequence_len: 2048
sample_packing: true


gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 4

optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5

bf16: auto
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
save_only_model: false


logging_steps: 1
# flash_attention: true
sdp_attention: true

warmup_ratio: 0.1
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 3
weight_decay: 0.0
special_tokens:
  pad_token: <|end_of_text|>

and then ran inference on the model
using the code

from transformers import MllamaForCausalLM, AutoTokenizer
import torch

def run_inference():
    # Paths
    # model_path = ""
    model_path = ""
    tokenizer_path = ""

    # Load tokenizer from your custom path
    tokenizer = AutoTokenizer.from_pretrained(tokenizer_path, use_fast=False)

    # Load model, allow size mismatch just in case
    model = MllamaForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16,
        device_map="auto",
        ignore_mismatched_sizes=True
    )

    # Ensure embeddings match tokenizer
    model.resize_token_embeddings(len(tokenizer))

    # Conversation
    conversation = [
        {"role": "system", "content": "<system_prompt>"},
        {"role": "user", "content": "<question>"}
    ]

    formatted_prompt = tokenizer.apply_chat_template(
        conversation,
        tokenize=False,
        add_generation_prompt=True
    )
    print("Formatted prompt:\n", formatted_prompt)

    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            # temperature=0.7,   
            # top_p=0.0,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id
        )

    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("\n=== FULL RESPONSE ===")
    print(full_response)

    if "assistant" in full_response:
        assistant_response = full_response.split("assistant")[-1].strip()
        print("\n=== EXTRACTED ASSISTANT RESPONSE ===")
        print(assistant_response)

if __name__ == "__main__":
    run_inference()

I got the output

istrovstvíSections 10(23FCA)Section 115TC(2)(i)Section 115BAC(2)(ii)(a)Section 115TC(2)(zzw)Section 269M(5)Rule 2BAmarket linked debentureRule 11UD(a)financial yearSection 47(xiizzzzzzl)Section 35CCA(2)Section 206C(3ZZZZZZZS)Prescribed InformationSection 32Section 263(1)(iii)Section 92CC(5)Section 133A(3)(ii)Section 54ED(3)(a)Rule 42(2)(iii)Form No. 3CF‑IIRule 37BA(5)Section 124(4)Section 286(1)(k)GenerationStrategySection 10C(2)(a)Rule 8B(1)(b)Section 32A(2)(d)Section 245A(d)Sub‑section (3E)1st April 2017Section 280B(a)Section 245-OA(3)(i)Section 35AD(8)(b)Section 140B(3)(i)Section 226(8)Section 2(1)(ta)Section 102(7)Section 115AC(2)80JJASection 80HHE(1B)(iii)Rule 10TD(3)(ii)Rule 40BA(2)Section 245A(b)(iv)Section 23(3)(b)Rule 48E(2)(g)Rule 8BA(2)Section 272AA(2)Communal Harmonydomestic companiesSection 158BE(4)(i)Rule 37BBBA(2)Rule 112(8A)Section 245T(4)Rule 10TFSections 208, 140ATax on capital gainsseized materialRule 17A(3)(ii)CodeAt23 ofRule 121A(2)Section 269UO(d)TonnageSection 133B(2)(e)Section 115JB(2A)(c)Rule 11UAE(3)(a)conversion into moneySection 80D(5)Section 139B(4)Section 116(i)Rule 73(1)Foreign ExchangeSection 13B(3)Section 269T(1)(d)Section 112(1)(c)Section 44AF(1)Section 115VX(1)(b)(i)(a)Section 80C(2)(xiiia)uyếtreySection 285BA(7)recognised provident fund1st April, 2021Section 9A(4)(f) rencontSection 88158BGSection 54EE(3)(a)Section 92A(2)Section 115JHrychITTERSection 47(vii)(a)

Section 115JG(2) ExplanationSection 10B(6)Section 184(4)Section 246(1)(j)Section 80G(4)(A)Section 115WDRule 10CB(1)(c)(i)Section 239A(1)(b)Section 115TC(2)(zzw)Section 293A(2)(c)Section 144B(6)(vi)Rule 44H(5)Section 287A(2)(f)Section 292C(1)(b)advance pricing agreementSection 252A(1)(b)stakingSection 115VX(2)(ii)Rule 28AA(1)ismetSection 245BA(6B)Section 112A(1)(a)(i)Rule 12D(4)Rule 44C(3)(g)urette245Tuz TrevSection 254.scalablytypedSection 60Section 115VZ(1)Sections 220 to 232BSection 58(1)(c)Section 134(1)Section 89A(4) HOLDERSSection 115V-O(1)(i)Section 92BA(vb)Rule 11RA(5)wilful attemptSection 115JBSection 115BAB(2)(b)(i)Section 80TTA(1)(c)Section 47(v)(a)Section 115BA(2)(a)(ii)ýtRule 21AAA(2)Section 133A(3)Rule 11TążRule 114‑I(1)Section 47(xiizzzb)Section 151(2)(iii)Section 115TC(2)(zy)Section 285BA(374)2025-26Minimum additionalSection 80QQB(3)(c)Section 158BC(1)(b)Notifications under Section 197A(1F)Section 27(iiiaa)Excluded transactionsRule 31A(6)(ii)wilRule 44E(5)Section 133(1)(d)Rule 10F(b)Section 115AC(2)(a)Rule 128(1)Section 180A(11)Section 35AD(5)(ak)iteralsSection 133A(1)(iii)Section 285BA(49)80GGCSection 115JB(7)Section 407Section 139C(1)Section 80HHE(3)Section 270A(3)(iii)Section 80-IBA(2)(a)(i)Explanation to Section 80-IA(4)(iv)(c)Section 115VD(3)(iii)Rule 10TE(6)Rule 10V(1)Section 285BA(66)quiaEquity Linked SavingsDepositories Act, 1996Section 3(36)Section 115VD(1)(j)mutatis mutandisRule 125(3)Section 40(ba)Chapter VI-BClause (xxiv)Section 92CC(9)Rule 10H(9)SPVSection 115BBI(2)(b)Section 12AC(2)(c)Section 144B(3)(v)Section 115TC(2)(h)Section 93(4)Section 115ACA(a)(ii)Section 10(20)Section 80‑IBA(2)(e)Section 42(2)(b)Section 245A(f)Section 88E(4)Rule 21A(3)(i)any directorForm No. 10BBBPart IISection 245W(2)(b)Section 246A(1)(e)Rule 114(2)Section 198(1)Section 12AB(1)(d)Section 10(29A)(b)Section 115JG(3)(iii)Section 80U(4)Section 270A(7)(a)Section 170A(3)(b)234BSection 116(cc)Section 271AAB(1)(a)(i)Rule 17C(1)Section 156(2)(b)Section 47(xiizza)Section 276B(b)(iii)Form No. 15D167BTax Return PreparerSection 285BA(295)Rule 65Section 139BRule 30(1)(d)Rule 10MA(4) ProvisoSection 245BA(3)any other allowanceSection 80CCG(2)Specified proceedingForm No. 10CCQSection 112A(2)(ii)Joint Directors of Income-taxnotified institutionsSection 264B(1)(a)Section 115WB(2)(E)(vi)Gross Annual ValueSection 115J(4)tonnage tax businessSection 295(2)(h)Section 54B(1)(i)Section 277(1)Beneficial OwnerSection 285BA(380)Section 115VT(3)(b)Section 269-UD(1)Section 115WKC(4)Section 80-IBA(2)(c)geoisSections 251Section 110(a)Section 269M(1)(a)Exclude freightSection 245BC(2)(b)Section 145(2B)Section 151(2)Section 115AD(3ZZZZZZR)kieRules 48–57Section 13(2)Section 275ASection 115WE(1A)Rule 6AB(1)(e)CBDT circularsSection 228A(1)Rule 114DSection 271AAB(1)(a)(ii)Section 245AA(3)(b)Section 115WC(1)(D)Section 245A(m)amalgamating companyForm No. 10BSection 115R(2)(i)Section 139AA(iv)271ESection 80HHE(b)aravelForm 16DSection 269UB(3)(b)Rule 28(3)(i)Rule 30(6A)Section 295(2)(b)Section 259(2)(a)Section 47(xiizzzzc)Sections 158BESection 115VR(2)accoSection 80JJA(5)60/2018Section 115WE(1)(c)(i)limited liability partnershipSection 45(2A)Section 297(2)(l)reibSection 9A(8A)Rule 37CA(1)(ii)Section 92BA(vb)Section 80‑IA(10)Section 286(9)(l)Section 2(1)(q)Section 11(1)(c)(i)Section 144B(7)(ix)private discretionarySection 115AD(3ZZZG)Rule 10TA(1)(iv)Section 271AAB(1A)(a)(i)Rule 6G(1)(a)Section 155(5L)Section 54EC(1)(a)Section 47(xiizl)Section 115BAC(2)(iii)Set‑off of LossSection 206C(3ZZZA)Excess interestTaxable salarySection 272A(2)(m)ernerWealth-tax Act, 1957Section 10(6B)Section 47(xiizg)Section 144BA(3)Paragraph 3Section 80HHB(2)(b)(iii)Rule 40(1)(E)Annexure VSection 35(5)claim disallowedSection 115AD(3ZZZZZZB)Section 151A(2)(ii)Section 43D(f)Rule 31A(2)(b)Section 269UO(a)Rule 6ABA(1)(d)Section 269N(a) Section 269UO(a)Rule 10UD(1)(i)Section 115WKA(2)(d)Section 269UA(b)(2)(i)Section 245MA(2)(b)(iii)Section 192ASection 153CRule 31(3)(v) مجSection 285BA(207)Section 115WB(1)(c)Rule 47Section 232(5)Section 160(2)Sections 272BRule 41BRule 11UA(1)(c)(b)(L)245CSection 112A(2)(ii)Rule 10H(3)Section 80EEB(5)(b)(ii)Section 115BBHSection 35CCA(2)(e)Section 2(25A)èoSection 133B(2)(a)Section CodeSection 115R(2)(b)Section 115JA(2)(v)Rule 48K(1) DünForm No. 35ASection 80AC(1)(b)Sections 166Section 194N(a)Clause (xii)(b)Section 245D(6)infrastructure facilitySection 245T(1)(c)Section 97(1)(f)Category II AIFSection 91(4)Section 80-IA(3)(ii)Winnings coveredegersequity sharesSection 35ERule 11UAD(1)(v)auditorSection 234A(3)(c)Section 33(1)(b)(iii)(b)Section 167B(2)Section 142B(2)Section 31(3)Section 35AD(5)(ii)Section 285BA(446)ICDS IIISection 115BAB(2)(b)Section 80-IB(10)(e)Section 176(5)(a)Section 80CCH(1)Section 115TC(2)(zr)Rule 31A(2)(iii)EFAULTningerSection 286(9)(d)(i)Section 245F(1)Section 115V(2)(e)Section 115JA(1A)Rule 10TB(1)(iv)alseSection 10B(1A)1st April, 201943/2017House Rent AllowanceSection 115UA(2)(i)Finance Act, 1988Section 194J(3)Section 33B(2)(a)Section 172(1) ProvisoSection 245Q(2)Section 206C(3ZZZO)Rule 12CB(1)(b)ilogySection 285BA(31)Section 118(1)(b)Section 47(vii)346Rule 16F(2)Section 234C(1)(b)(iii)Section 144C(8)(b)Rule 12B(5)Section 47(xiizzzq)skoquoted sharesSections 139(4A)Section 97(5)any other propertyRule 42Section 197A(2)Section 59(1)(b)Section 250(7)Rule 44G(1)Section 285BA(440)Rule 112D(2)ivicンダRule 46A(2)Section 155(10E)Section 9B(i)Section 88E(2)(d)Section 33AC(1)(b)Fourth ScheduleSection 72A(4)Section 44AARule 133(4)(iii)IntelligenceRule 10D(1)(c)–(f)acadesSection 285BA(250)Section 16(iia)Section 115QD(2)azinesSection 124(3)(c)nature of incomeSection 273A(4)Rule 11Q(3)Rule 48K(3)Section 245BD(3)Rule 8B(1)(b)Section 245HA(1)(iii)Section 45(1A)(ii)LastErrorSection 115ACA(1)(ii)(B)Rule 114-I(1)(d)deenspecified sumRule 10UOCarry ForwardSection 115V-I(4)(b)Excess PaymentRule 114A(1)(b)Specified incomeSection 35A(1)Section 80DD(1)Section 282A(4)ситSection 206C(3ZZZZZZC)Section 285BA(176)Section 273(1)(a)Section 115V(2)(d)Section 115C(f)(iv)Form 16ASection 234F(1)Section 115VK(4)(c)̧Rule 19AE(4)Section 115WC(2)Rule 10D(4)(vi)Prescribed ParticularsulpSection 206CB(1)(b)(v)Section 144B(6)(i)(A)Rule 21AJE(8)(vii)Section 80‑IC(3)(i)Section 285B(1)Section 115ACAVOKE

which is just a mess of the custom tokens I added to the tokenizer which I had used to train Llama-3.2-11B-Vision

base_model: alpindale/Llama-3.2-11B-Vision-Instruct
tokenizer_config: ./itai_tokenizer
tokenizer_type: AutoTokenizer

except this tokenizer was made using code that looks likes

    def create_tokenizer(self):
        # Load the base tokenizer
        tokenizer = AutoTokenizer.from_pretrained("NousResearch/Meta-Llama-3.1-8B-Instruct")

should this tokenizer have been from alpindale/Llama-3.2-11B-Vision-Instruct?
or is this fine since I used chat_template: llama3 to train the model along with the tokenizer of NousResearch/Meta-Llama-3.1-8B-Instruct?

also for some reason

logging_steps: 1
# flash_attention: true
sdp_attention: true

if I set Flash Attention I get the error

AttributeError: 'MllamaTextSelfAttention' object has no attribute 'is_causal'

why is that?
even though
the config given in examples for Llama3.2 Vision
says

gradient_checkpointing: true
logging_steps: 1
flash_attention: true  # use for text-only mode

Could you or someone else please help me out on what the issue might be?
Also could you tell me how to learn more on this? I would really appreciate it.

Thank You.

NanoCode012 Oct 13, 2025
Maintainer

Hey!

should this tokenizer have been from alpindale/Llama-3.2-11B-Vision-Instruct?

What is the reason behind using a diff model's tokenizer?

or is this fine since I used chat_template: llama3 to train the model along with the tokenizer of NousResearch/Meta-Llama-3.1-8B-Instruct?

I would recommend using the same tokenizer built in. If you want to add custom tokens, we can do so with the tokens: or added_tokens_overrides: feature.

if I set Flash Attention I get the error AttributeError: 'MllamaTextSelfAttention' object has no attribute 'is_causal'

Could you clarify your setup? By text only mode, it would be when processor_type: isn't provided. Is that how it was? Maybe you can provide repro and more trace?

ppraval Oct 13, 2025
Author

Hi, I have figured it out

# Llama-3.2 11B Vision Instruct — text-only SFT while preserving multimodality
# Key changes:
# - Use the model's AutoProcessor + llama3_2_vision chat template
# - Disable Flash-Attention, use SDPA
# - LoRA on language self-attn + MLP only; freeze vision/cross-attn by default
# Docs:
#   - HF Mllama: https://huggingface.co/docs/transformers/en/model_doc/mllama
#   - Axolotl multimodal: https://docs.axolotl.ai/docs/multimodal.html
#   - Chat templating (multimodal): https://huggingface.co/docs/transformers/en/chat_templating_multimodal

base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
# model_type: MllamaForCausalLM

# Use the model's processor/tokenizer. Do NOT point at a 3.1 tokenizer.
processor_type: AutoProcessor
processor_config: ./<self_made_processor>
chat_template: llama3_2_vision

datasets:
  - path: ./<dataset>.jsonl
    type: chat_template
    field_messages: messages
    message_property_mappings:
      role: role
      content: content
    roles:
      system: [system]
      user: [user]
      assistant: [assistant]
train_on_inputs: false

# Multimodal plumbing even if training is text-only
skip_prepare_dataset: true
remove_unused_columns: false
sample_packing: false

output_dir: ./outputs/<model>

# Sequence + training
sequence_len: 1024
gradient_accumulation_steps: 8
micro_batch_size: 2
num_epochs: 4

# Optim
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 2e-5
weight_decay: 0.0

# Precision
bf16: true
tf32: false

# Checkpointing
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
save_only_model: false

# Logging/saving
logging_steps: 1
evals_per_epoch: 2
saves_per_epoch: 1
save_total_limit: 2

# Attention kernel
# flash_attention: false   # keep disabled for Mllama
sdp_attention: true        # use PyTorch SDPA

# Pad token aligned to model defaults
special_tokens:
  pad_token: "<|end_of_text|>"

# LoRA to preserve multimodality: adapt ONLY language blocks
# adapter: lora
# lora_r: 16
# lora_alpha: 32
# lora_dropout: 0.05
# # Regex targets language self-attn and MLP projections, excludes cross-attn and vision
# # See PEFT issue on regex patterns and module naming:
# #   https://github.com/huggingface/peft/issues/2165
# # lora_target_modules: >
# #   model.language_model.layers.[\\d]+.(self_attn|mlp).(q_proj|k_proj|v_proj|o_proj|up_proj|down_proj|gate_proj)
# lora_target_modules: 'model.language_model.layers.[\\d]+.(self_attn|mlp).(q|k|v|o|up|down|gate)_proj'

A config such as this works. Flash attention isn't made for multimodal.
I tried it with
tokenizer_config: but it just gave me a random jumbled mess of the tokens that I added.

Thanks

NanoCode012 Oct 14, 2025
Maintainer

I tried it with tokenizer_config: but it just gave me a random jumbled mess of the tokens that I added.

Could you clarify a bit more on what you mean?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training a Vision model on a Text-Only Dataset #3199

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Training a Vision model on a Text-Only Dataset #3199

Uh oh!

ppraval Oct 5, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

NanoCode012 Oct 6, 2025 Maintainer

Uh oh!

ppraval Oct 7, 2025 Author

Uh oh!

ppraval Oct 10, 2025 Author

Uh oh!

NanoCode012 Oct 13, 2025 Maintainer

Uh oh!

ppraval Oct 13, 2025 Author

Uh oh!

NanoCode012 Oct 14, 2025 Maintainer

ppraval
Oct 5, 2025

Replies: 1 comment 5 replies

NanoCode012
Oct 6, 2025
Maintainer

ppraval Oct 7, 2025
Author

ppraval Oct 10, 2025
Author

NanoCode012 Oct 13, 2025
Maintainer

ppraval Oct 13, 2025
Author

NanoCode012 Oct 14, 2025
Maintainer