-
Notifications
You must be signed in to change notification settings - Fork 97
LoRA fine-tuned model produces identical inference results to base model + checkpoint loading warnings #176
Description
Environment:
- Model:
nvidia/Cosmos-Predict2-2B-Video2World - Training: LoRA (rank=16, alpha=16)
- Hardware: 2x GPUs with FSDP
- Dataset: 100 robot manipulation videos (77 frames, 720p @ 16fps)
Issue Description
After successfully completing LoRA post-training for 500 iterations, the fine-tuned model generates identical outputs to the base model during inference. Additionally, I observe warnings during checkpoint loading that suggest potential issues with LoRA weight application.
Training Setup
Training completed successfully:
export EXP=predict2_video2world_lora_training_2b_cosmos_nemo_assets
torchrun --nproc_per_node=2 --master_port=12341 \
-m scripts.train \
--config=cosmos_predict2/configs/base/config.py -- \
experiment=${EXP} \
job.name=77f_test_lora \
model.config.fsdp_shard_size=2 \
model.config.train_architecture=lora \
dataloader_train.dataset.num_frames=77 \
trainer.max_iter=1000 \
checkpoint.save_iter=500Training configuration (from config.yaml):
model:
train_architecture: lora
lora_rank: 16
lora_alpha: 16
lora_target_modules: q_proj,k_proj,v_proj,output_proj,mlp.layer1,mlp.layer2
fsdp_shard_size: 2
optimizer:
lr: 4.315837287515549e-05 # 2**(-14.5)
type: fusedadamw
scheduler:
type: lambdalinear
warm_up_steps: [0]
cycle_lengths: [1000]
f_max: [0.6]
trainer:
max_iter: 1000
batch_size: 1LoRA injection confirmed during training:
[INFO] LoRA injection successful: 22,937,600 trainable parameters out of 1,979,351,040 total (1.159%)
[INFO] LoRA parameter breakdown:
lora_A: 11,010,048 parameters
lora_B: 11,927,552 parameters
Total LoRA: 22,937,600 parameters
Dataset:
- 100 robot arm manipulation videos
- All videos: exactly 77 frames (verified with ffprobe)
- Resolution: 720p (1280x704) @ 16fps
- Content: First-person view robot manipulation tasks
- All videos have corresponding text prompts
Problem 1: Identical Inference Results
Inference with base model:
python scripts/hf_video2world_lora.py \
/workspace/dream-outputs/test_ori_model \
--prompt dream-datasets/prompts/test_prompt.txt \
--image dream-datasets/picture/test_robot_1.jpg \
--model nvidia/Cosmos-Predict2-2B-Video2World \
--height 720 --width 1280 \
--fps 16 --frames 77 --steps 35 -vInference with LoRA fine-tuned model:
python scripts/hf_video2world_lora.py \
/workspace/dream-outputs/test_model \
--prompt dream-datasets/prompts/test_prompt.txt \
--image dream-datasets/picture/test_robot_1.jpg \
--model nvidia/Cosmos-Predict2-2B-Video2World \
--lora_checkpoint checkpoints/posttraining/video2world_lora/77f_test_lora/checkpoints/model/iter_000000500.pt \
--height 720 --width 1280 \
--fps 16 --frames 77 --steps 35 -vResult: Both outputs are visually identical (frame-by-frame comparison shows no differences).
Problem 2: Checkpoint Loading Warnings
During LoRA checkpoint loading in inference, I see these warnings:
Loading LoRA checkpoint: checkpoints/.../iter_000000500.pt
✅ Found 1120 LoRA parameters
Missing keys: 567
Unexpected keys: 1120
✅ LoRA weights loaded! Scale: 1.0
Questions:
- What do "Missing keys: 567" and "Unexpected keys: 1120" indicate?
- Does this mean the LoRA weights are not being properly applied to the model?
- Is this why inference results are identical to the base model?
Additional Observation: Training Loss Behavior
During training, the loss shows unusual patterns:
Iter 1-8: Loss oscillates between 0.6-0.8 (normal range)
Iter 9: Loss: 4.1003 ← sudden spike!
Iter 10+: Loss returns to 0.6-0.8
...
Iter 310: Loss: 2.2639 ← another spike
...
Iter 500: Loss: 0.5779
The loss does not show a clear downward trend and has occasional large spikes. Not sure if this is related to the inference issue.
Questions
-
Are the checkpoint loading warnings ("Missing keys", "Unexpected keys") preventing LoRA weights from being applied during inference?
-
How can I verify that LoRA weights are actually being used during inference?
- Is there a way to check if the model behavior has changed?
-
Is it expected for LoRA fine-tuning to produce visible differences with only 500-1000 iterations?
- Should I train for longer?
-
Are there any known issues with LoRA training on Cosmos-Predict2-2B-Video2World?
- Should I consider full fine-tuning instead?
What I've Verified
✅ All 100 training videos are exactly 77 frames (verified with ffprobe)
✅ Training completes without errors
✅ LoRA parameters are injected successfully (22.9M trainable params)
✅ Checkpoint files are saved correctly
✅ Using the same prompt and image for both base and fine-tuned inference
Expected Behavior
After LoRA post-training, I expect the fine-tuned model to generate outputs that differ from the base model and are more aligned with my robot manipulation training data.
Request
Could you help clarify:
- Whether the checkpoint loading warnings indicate a problem
- If there are additional steps needed to properly load/apply LoRA weights during inference
- Recommended training parameters or best practices for LoRA on Video2World
Thank you!