LoRA fine-tuned model produces identical inference results to base model + checkpoint loading warnings

**Environment:**
- Model: `nvidia/Cosmos-Predict2-2B-Video2World`
- Training: LoRA (rank=16, alpha=16)
- Hardware: 2x GPUs with FSDP
- Dataset: 100 robot manipulation videos (77 frames, 720p @ 16fps)

---

## Issue Description

After successfully completing LoRA post-training for 500 iterations, the fine-tuned model generates **identical outputs** to the base model during inference. Additionally, I observe warnings during checkpoint loading that suggest potential issues with LoRA weight application.
---
## Training Setup

**Training completed successfully:**
```bash
export EXP=predict2_video2world_lora_training_2b_cosmos_nemo_assets

torchrun --nproc_per_node=2 --master_port=12341 \
    -m scripts.train \
    --config=cosmos_predict2/configs/base/config.py -- \
    experiment=${EXP} \
    job.name=77f_test_lora \
    model.config.fsdp_shard_size=2 \
    model.config.train_architecture=lora \
    dataloader_train.dataset.num_frames=77 \
    trainer.max_iter=1000 \
    checkpoint.save_iter=500
```

**Training configuration (from config.yaml):**
```yaml
model:
  train_architecture: lora
  lora_rank: 16
  lora_alpha: 16
  lora_target_modules: q_proj,k_proj,v_proj,output_proj,mlp.layer1,mlp.layer2
  fsdp_shard_size: 2

optimizer:
  lr: 4.315837287515549e-05  # 2**(-14.5)
  type: fusedadamw

scheduler:
  type: lambdalinear
  warm_up_steps: [0]
  cycle_lengths: [1000]
  f_max: [0.6]

trainer:
  max_iter: 1000
  batch_size: 1
```

**LoRA injection confirmed during training:**
```
[INFO] LoRA injection successful: 22,937,600 trainable parameters out of 1,979,351,040 total (1.159%)
[INFO] LoRA parameter breakdown:
   lora_A: 11,010,048 parameters
   lora_B: 11,927,552 parameters
   Total LoRA: 22,937,600 parameters
```

**Dataset:**
- 100 robot arm manipulation videos
- All videos: exactly 77 frames (verified with ffprobe)
- Resolution: 720p (1280x704) @ 16fps
- Content: First-person view robot manipulation tasks
- All videos have corresponding text prompts

---

## Problem 1: Identical Inference Results

**Inference with base model:**
```bash
python scripts/hf_video2world_lora.py \
    /workspace/dream-outputs/test_ori_model \
    --prompt dream-datasets/prompts/test_prompt.txt \
    --image dream-datasets/picture/test_robot_1.jpg \
    --model nvidia/Cosmos-Predict2-2B-Video2World \
    --height 720 --width 1280 \
    --fps 16 --frames 77 --steps 35 -v
```

**Inference with LoRA fine-tuned model:**
```bash
python scripts/hf_video2world_lora.py \
    /workspace/dream-outputs/test_model \
    --prompt dream-datasets/prompts/test_prompt.txt \
    --image dream-datasets/picture/test_robot_1.jpg \
    --model nvidia/Cosmos-Predict2-2B-Video2World \
    --lora_checkpoint checkpoints/posttraining/video2world_lora/77f_test_lora/checkpoints/model/iter_000000500.pt \
    --height 720 --width 1280 \
    --fps 16 --frames 77 --steps 35 -v
```

**Result:** Both outputs are **visually identical** (frame-by-frame comparison shows no differences).

---

## Problem 2: Checkpoint Loading Warnings

During LoRA checkpoint loading in inference, I see these warnings:
```
Loading LoRA checkpoint: checkpoints/.../iter_000000500.pt
✅ Found 1120 LoRA parameters
   Missing keys: 567
   Unexpected keys: 1120
✅ LoRA weights loaded! Scale: 1.0
```

**Questions:**
- What do "Missing keys: 567" and "Unexpected keys: 1120" indicate?
- Does this mean the LoRA weights are not being properly applied to the model?
- Is this why inference results are identical to the base model?

---

## Additional Observation: Training Loss Behavior

During training, the loss shows unusual patterns:
```
Iter 1-8:   Loss oscillates between 0.6-0.8 (normal range)
Iter 9:     Loss: 4.1003  ← sudden spike!
Iter 10+:   Loss returns to 0.6-0.8
...
Iter 310:   Loss: 2.2639  ← another spike
...
Iter 500:   Loss: 0.5779
```

The loss does not show a clear downward trend and has occasional large spikes. Not sure if this is related to the inference issue.

---

## Questions

1. **Are the checkpoint loading warnings ("Missing keys", "Unexpected keys") preventing LoRA weights from being applied during inference?**
   
2. **How can I verify that LoRA weights are actually being used during inference?**
   - Is there a way to check if the model behavior has changed?

3. **Is it expected for LoRA fine-tuning to produce visible differences with only 500-1000 iterations?**
   - Should I train for longer?

4. **Are there any known issues with LoRA training on Cosmos-Predict2-2B-Video2World?**
   - Should I consider full fine-tuning instead?

---

## What I've Verified

✅ All 100 training videos are exactly 77 frames (verified with ffprobe)  
✅ Training completes without errors  
✅ LoRA parameters are injected successfully (22.9M trainable params)  
✅ Checkpoint files are saved correctly  
✅ Using the same prompt and image for both base and fine-tuned inference  

---

## Expected Behavior

After LoRA post-training, I expect the fine-tuned model to generate outputs that differ from the base model and are more aligned with my robot manipulation training data.

---

## Request

Could you help clarify:
1. Whether the checkpoint loading warnings indicate a problem
2. If there are additional steps needed to properly load/apply LoRA weights during inference
3. Recommended training parameters or best practices for LoRA on Video2World

Thank you!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA fine-tuned model produces identical inference results to base model + checkpoint loading warnings #176

Issue Description

After successfully completing LoRA post-training for 500 iterations, the fine-tuned model generates identical outputs to the base model during inference. Additionally, I observe warnings during checkpoint loading that suggest potential issues with LoRA weight application.

Training Setup

Problem 1: Identical Inference Results

Problem 2: Checkpoint Loading Warnings

Additional Observation: Training Loss Behavior

Questions

What I've Verified

Expected Behavior

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LoRA fine-tuned model produces identical inference results to base model + checkpoint loading warnings #176

Description

Issue Description

After successfully completing LoRA post-training for 500 iterations, the fine-tuned model generates identical outputs to the base model during inference. Additionally, I observe warnings during checkpoint loading that suggest potential issues with LoRA weight application.

Training Setup

Problem 1: Identical Inference Results

Problem 2: Checkpoint Loading Warnings

Additional Observation: Training Loss Behavior

Questions

What I've Verified

Expected Behavior

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions