-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Add QAT NVFP4 configs for blogpost #3280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughAdds nine new YAML configuration files for quantization-aware training (QAT) using NVFP4 quantization across multiple LLMs (Gemma3-12B, Gemma3-27B, Qwen2.5-72B). Configurations define model setup, quantization parameters, Liger plugin settings, FSDP distributed training options, optimizer/scheduler settings, and datasets. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
examples/qat_nvfp4/Qwen2.5-72B_baseline.yml (1)
51-51: Consider reducing logging verbosity for large-scale training.
logging_steps: 1logs on every training step. For a 72B model, this can generate significant overhead and noise. Consider increasing this to10,50, or100depending on your dataset size and desired observability.examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml (1)
50-52: Consider enabling checkpoint saving to prevent training loss on failure.Setting
save_strategy: "no"means no intermediate checkpoints are saved during training. If training fails, crashes, or is interrupted mid-way, all progress is lost. This is particularly risky for long-running 72B model training on expensive hardware.Consider changing this to one of the safer alternatives:
- save_strategy: "no" + save_strategy: "steps" + save_steps: 100Alternatively, if you only want to save the final model, ensure you have a separate monitoring/recovery strategy in place.
examples/qat_nvfp4/Math-Gemma3-27B_qat.yml (1)
59-59: Note: Zero weight decay disables L2 regularization.
weight_decay: 0.0means no L2 regularization is applied. While this may be intentional for QAT (quantization sometimes works better without weight regularization), it's worth confirming this choice aligns with your training objectives. If regularization is desired, consider increasing to a typical value like0.01.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/qat_nvfp4/Gemma3-12B_baseline.yml(1 hunks)examples/qat_nvfp4/Gemma3-12B_qat.yml(1 hunks)examples/qat_nvfp4/Math-Gemma3-12B_baseline.yml(1 hunks)examples/qat_nvfp4/Math-Gemma3-12B_qat.yml(1 hunks)examples/qat_nvfp4/Math-Gemma3-27B_baseline.yml(1 hunks)examples/qat_nvfp4/Math-Gemma3-27B_qat.yml(1 hunks)examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml(1 hunks)examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml(1 hunks)examples/qat_nvfp4/Qwen2.5-72B_baseline.yml(1 hunks)examples/qat_nvfp4/Qwen2.5-72B_qat.yml(1 hunks)
🔇 Additional comments (17)
examples/qat_nvfp4/Math-Gemma3-27B_baseline.yml (2)
29-32: Well-structured QAT configuration with documented constraints.The NVFP4 quantization settings are clear, with the group_size constraint appropriately documented. The FSDP configuration and Liger optimizations are well-integrated for Gemma3-27B distributed training.
59-59: Verify thatweight_decay: 0.0is intentional for this QAT baseline.Disabling weight decay is unconventional for finetuning. Please confirm this choice aligns with your QAT training strategy, as weight decay often stabilizes training. If this is intentional to isolate quantization effects, consider documenting this decision in a comment.
examples/qat_nvfp4/Math-Gemma3-12B_qat.yml (1)
50-55: Verify checkpoint saving strategy—currently disabled.With
save_strategy: "no"andevals_per_epoch/saves_per_epochcommented out (lines 54-55), no checkpoints will be saved during training. This differs from the regularGemma3-12B_qat.ymlconfig, which has evaluation and checkpointing enabled.Clarify the intent: Is this intended for quick validation runs, or should evaluation/checkpoint saving be enabled for actual training?
examples/qat_nvfp4/Gemma3-12B_baseline.yml (1)
1-73: Configuration structure looks correct—consistent with baseline pattern.The baseline config is well-structured and consistent with other baseline files (Math-Gemma3-12B_baseline.yml). Checkpoint saving is appropriately disabled for baseline runs. Minor note: no issues detected.
examples/qat_nvfp4/Gemma3-12B_qat.yml (1)
50-54: Inconsistency with Math-Gemma3-12B_qat.yml checkpoint strategy.This file enables evaluation and checkpointing (
evals_per_epoch: 1,saves_per_epoch: 1on lines 53-54), but the parallel Math-Gemma3-12B_qat.yml has these commented out and usessave_strategy: "no".Both are QAT configs for the same base model—clarify whether this difference is intentional. If not, align the Math version to match this pattern.
examples/qat_nvfp4/Math-Gemma3-12B_baseline.yml (1)
1-73: Baseline configuration is consistent with the baseline pattern.Math-Gemma3-12B_baseline.yml follows the expected baseline convention (checkpoint saving disabled). No issues detected in this file.
examples/qat_nvfp4/Qwen2.5-72B_qat.yml (2)
39-45: Clarify training hyperparameter choices vs. Math baseline.This config uses significantly different hyperparameters compared to the Math-Qwen2.5-72B_baseline.yml:
- Learning rate: 2e-5 (vs. 5e-6 in baseline) — 4x higher
- Micro batch size: 16 (vs. 8 in baseline) — 2x larger
- Missing
eta_minparameter (baseline has 7e-7)Verify these differences are intentional based on the dataset and task differences (alpaca vs. NuminaMath), or align them for consistency.
25-25: Sequence length value is correct for this configuration.The review conflates two different configurations. The file uses
sequence_len: 8096because it trains on the Alpaca dataset (tatsu-lab/alpaca), while the referenced "Math baseline" (Math-Qwen2.5-72B_baseline.yml) uses 4096 because it trains on a different dataset (AI-MO/NuminaMath-CoT). The proper baseline for this config isQwen2.5-72B_baseline.yml, which also usessequence_len: 8096for the same Alpaca dataset. The sequence length is intentionally dataset-dependent, not a typo.Likely an incorrect or invalid review comment.
examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml (1)
1-73: Config structure and settings look solid.The Math baseline configuration is well-structured with appropriate hyperparameters for math-focused fine-tuning: conservative learning rate (5e-6), proper cosine scheduler configuration with eta_min, and standard sequence length. FSDP settings align with other configs in this PR.
Once the sequence_len issue in Qwen2.5-72B_qat.yml is resolved and hyperparameter choices are justified, verify that all nine NVFP4 configs mentioned in the PR summary follow consistent patterns for model setup, quantization, and FSDP configuration.
examples/qat_nvfp4/Qwen2.5-72B_baseline.yml (3)
12-16: LGTM: Comprehensive QAT + FSDP + Liger configuration.The quantization, distributed training, and optimization settings are well-structured:
- NVFP4 quantization with group_size 16 (correctly documented as the only supported option).
- All Liger optimizations enabled for Qwen compatibility.
- FSDP v2 with activation checkpointing, appropriate for 72B model training.
- bf16 precision and efficient CPU RAM loading enabled.
Assuming the sequence length and checkpoint strategy issues above are clarified, this baseline is well-configured.
Also applies to: 29-32, 61-69
50-52: I'll help you verify this review comment about the checkpoint persistence strategy. Let me start by examining the file and its context.
<function_calls>
cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls>
<function_calls>
wc -l examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls><function_calls>
ls -la examples/qat_nvfp4/
</function_calls><function_calls>
head -70 examples/qat_nvfp4/Qwen2.5-72B_baseline.yml | tail -25
</function_calls>Let me get a better view of the file structure and the specific lines mentioned:
<function_calls>
sed -n '45,60p' examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls>Now let me search for documentation or context about this configuration:
<function_calls>
find examples/qat_nvfp4/ -type f -name ".md" -o -name ".txt" -o -name "README*"
</function_calls>Let me examine what other baseline configs exist and their checkpoint strategies:
<function_calls>
find examples/ -name "baseline.yml" | head -10
</function_calls>Let me also search the repository for any documentation about baseline configurations:
<function_calls>
grep -r "save_strategy.no" examples/ --include=".yml" | head -20
</function_calls>Let me check if there's a README or documentation in the qat_nvfp4 directory:
<function_calls>
find examples/qat_nvfp4 -type f -name ".md" -o -name ".txt" | xargs cat 2>/dev/null
</function_calls>Let me get the complete baseline file to understand the full context:
<function_calls>
cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls>Now let me search for any documentation or scripts that explain how this baseline is meant to be used:
<function_calls>
find . -name ".md" -o -name ".rst" | xargs grep -l "baseline" 2>/dev/null | head -5
</function_calls>Let me check what other baseline configs look like for comparison:
<function_calls>
find examples/ -type f -name "baseline.yml" -exec bash -c 'echo "=== {} ==="; grep -A2 "save_strategy" "$1"' _ {} ;
</function_calls><function_calls>
python3 << 'EOF'
import os
import yamlFind all baseline yml files
baseline_files = []
for root, dirs, files in os.walk('examples'):
for f in files:
if 'baseline' in f and f.endswith('.yml'):
baseline_files.append(os.path.join(root, f))print(f"Found {len(baseline_files)} baseline YAML files:\n")
for bf in sorted(baseline_files):
try:
with open(bf, 'r') as file:
content = yaml.safe_load(file)
save_strategy = content.get('save_strategy', 'NOT SET')
resume_from = content.get('resume_from_checkpoint', 'NOT SET')
print(f"{bf}")
print(f" save_strategy: {save_strategy}")
print(f" resume_from_checkpoint: {resume_from}")
print()
except Exception as e:
print(f"{bf}: Error reading - {e}\n")
EOF
</function_calls>Let me also check if there's any training script or documentation explaining the purpose:
<function_calls>
find . -type f ( -name ".py" -o -name ".sh" ) -path "/qat_nvfp4/" | head -5
</function_calls>Let me search for any documentation in the repository root or parent directories:
<function_calls>
find . -maxdepth 3 -name "README*" -o -name "*.md" | grep -E "(^./README|examples)" | head -10
</function_calls>Let me check the git history to understand the intent of this baseline configuration:
<function_calls>
git log --oneline examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null | head -5
</function_calls><function_calls>
git show HEAD:examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null | grep -A5 -B5 "save_strategy"
</function_calls>Let me check if there's any issue or PR related to this baseline:
<function_calls>
cd examples/qat_nvfp4 && pwd && ls -la
</function_calls>Let me search for training-related documentation:
<function_calls>
find . -type f -name "*.md" | xargs grep -l "qat_nvfp4|baseline" 2>/dev/null | head -5
</function_calls>Let me look at the overall file to understand what type of training this is:
<function_calls>
sed -n '1,30p' examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls>Let me check what's in the qat_nvfp4 directory:
<function_calls>
ls -la examples/qat_nvfp4/
</function_calls>Let me check if there's a README in that directory:
<function_calls>
cat examples/qat_nvfp4/README.md 2>/dev/null || echo "No README found"
</function_calls>Now let me search the entire repo for context about this baseline or qat_nvfp4:
<function_calls>
find . -type f ( -name ".md" -o -name ".py" -o -name "*.sh" ) | xargs grep -l "qat_nvfp4|NVFP4" 2>/dev/null | head -10
</function_calls>Let me check if similar patterns exist in other training configs:
<function_calls>
grep -r "resume_from_checkpoint:" examples/ --include="*.yml" -A1 | grep -B1 "save_strategy.*no" | head -20
</function_calls>Let me look at the complete yml file structure:
<function_calls>
cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml | head -100
</function_calls>Now let me check if there's any trainer documentation or config schema:
<function_calls>
find . -name "*.py" | xargs grep -l "save_strategy.*no|checkpoint.*baseline" 2>/dev/null | head -5
</function_calls>Let me verify by looking at what "baseline" typically means in ML training context by checking similar files:
<function_calls>
grep -r "output_dir" examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
</function_calls>Let me review the complete file one more time to see the output_dir and understand the training purpose:
<function_calls>
python3 << 'EOF'
import yamlwith open('examples/qat_nvfp4/Qwen2.5-72B_baseline.yml', 'r') as f:
config = yaml.safe_load(f)print("Key configuration parameters:")
for key in ['output_dir', 'save_strategy', 'resume_from_checkpoint', 'num_train_epochs', 'save_steps', 'eval_strategy', 'eval_steps', 'do_train', 'do_eval']:
if key in config:
print(f" {key}: {config[key]}")
else:
print(f" {key}: NOT SET")
EOF
</function_calls>Perfect! Now I have the necessary information to verify the review comment. Let me analyze what I found:
Config is intentionally validation-only — no issue.
This baseline configuration is designed as a minimal reference setup for validation and testing, not for producing a trained model. The configuration has:
- Only 1 epoch (
num_train_epochs: 1)- No evaluation strategy (
eval_strategynot set)save_strategy: "no"(intentional for baseline/validation runs)resume_from_checkpointempty (correct, since no checkpoints are saved)Disabling checkpoint saving in a one-epoch validation baseline is the expected and correct behavior. This type of config is used to verify the training pipeline works without consuming disk space or resources.
Likely an incorrect or invalid review comment.
25-26: Now let me check if this configuration has been tested or validated:
<function_calls>
git log --oneline -20 examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null
</function_calls>
<function_calls>
find . -type f ( -name "test.py" -o -name ".yml" -o -name ".yaml" ) -path "/test" | xargs grep -l "qat|Qwen2.5-72B" 2>/dev/null | head -10
</function_calls>
<function_calls>
find . -type f -name "*.md" | xargs grep -l "8096|sequence_len.*alpaca|batch_size.*8096" 2>/dev/null
</function_calls>
</function_calls>Let me also check if there's any documentation about these baseline configurations:
<function_calls>
find . -path "/qat_nvfp4" -name ".md" -o -path "/qat_nvfp4*" -name "README*"
</function_calls>
<function_calls>
ls -la examples/qat_nvfp4/
</function_calls><function_calls>
grep -r "sequence_len.*8096|8096.sequence" . --include=".md" 2>/dev/null | head -5
</function_calls>Let me check the model's supported context length:
<function_calls>
Qwen2.5-72B maximum context length supported
</function_search><function_calls>
cat > /tmp/analyze_config.py << 'EOF'
import yamlLoad the baseline config
with open('examples/qat_nvfp4/Qwen2.5-72B_baseline.yml', 'r') as f:
baseline = yaml.safe_load(f)Load the math config for comparison
with open('examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml', 'r') as f:
math_config = yaml.safe_load(f)print("=== ALPACA QWEN2.5-72B BASELINE ===")
print(f"sequence_len: {baseline.get('sequence_len')}")
print(f"micro_batch_size: {baseline.get('micro_batch_size')}")
print(f"gradient_accumulation_steps: {baseline.get('gradient_accumulation_steps')}")
print(f"effective batch size per step: {baseline.get('micro_batch_size') * baseline.get('gradient_accumulation_steps')}")
print(f"total tokens per step: {baseline.get('sequence_len') * baseline.get('micro_batch_size') * baseline.get('gradient_accumulation_steps')}")
print(f"sample_packing: {baseline.get('sample_packing')}")
print(f"flash_attention: {baseline.get('flash_attention')}")
print(f"bf16: {baseline.get('bf16')}")
print(f"activation_checkpointing: {baseline.get('fsdp_config', {}).get('activation_checkpointing')}")print("\n=== MATH QWEN2.5-72B BASELINE (for comparison) ===")
print(f"sequence_len: {math_config.get('sequence_len')}")
print(f"micro_batch_size: {math_config.get('micro_batch_size')}")
print(f"gradient_accumulation_steps: {math_config.get('gradient_accumulation_steps')}")
print(f"effective batch size per step: {math_config.get('micro_batch_size') * math_config.get('gradient_accumulation_steps')}")
print(f"total tokens per step: {math_config.get('sequence_len') * math_config.get('micro_batch_size') * math_config.get('gradient_accumulation_steps')}")print("\n=== RATIO ANALYSIS ===")
print(f"Alpaca seq_len / Math seq_len: {baseline.get('sequence_len') / math_config.get('sequence_len')}x")
print(f"Alpaca batch_size / Math batch_size: {baseline.get('micro_batch_size') / math_config.get('micro_batch_size')}x")
print(f"Alpaca total tokens / Math total tokens: {(baseline.get('sequence_len') * baseline.get('micro_batch_size')) / (math_config.get('sequence_len') * math_config.get('micro_batch_size'))}x")
EOF
python /tmp/analyze_config.py
</function_calls><function_calls>
git log --all --oneline | grep -i "8096|qwen.*72.*baseline|alpaca.*qwen" | head -10
</function_calls>Based on my verification, I can now provide the rewritten review comment. Here are the key findings:
Evidence gathered:
- Dataset confirmed: The config explicitly uses
tatsu-lab/alpacadataset- Comparison with other configs:
- Math-Qwen2.5-72B configs use:
sequence_len: 4096+micro_batch_size: 8- Alpaca-Qwen2.5-72B configs use:
sequence_len: 8096+micro_batch_size: 16- This results in 4x more tokens per training step for Alpaca (129,536 vs 32,768)
- Mitigations in place: The config includes
activation_checkpointing=true,flash_attention=true,sample_packing=true, FSDP v2, and bf16—all designed to handle large contexts efficiently- Context: This is presented as a "baseline" configuration, suggesting it should be a known working reference
The original review comment raises a valid concern about the high sequence length for Alpaca, but it conflates two separate issues and makes assumptions not fully supported by the evidence.
Verify whether this Alpaca configuration has been validated to run successfully, especially given the 4x higher tokens-per-step (129,536) compared to the Math baselines (32,768).
sequence_len: 8096withmicro_batch_size: 16is notably high for Alpaca finetuning. Whilesample_packing: truemitigates token waste from short sequences, the configuration processes 4x more tokens per training step than the Math baseline configs (which usesequence_len: 4096, micro_batch_size: 8). Confirm this setting has been tested with your target hardware to validate memory efficiency, especially sincesave_strategy: "no"prevents checkpoint validation.examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml (2)
64-65: Qwen2DecoderLayer class is correct and compatible with FSDP configuration.Verification confirms:
Qwen2DecoderLayeris the standard decoder layer class for Qwen2/Qwen2.5 models in Hugging Face Transformers- The dependency specifies
transformers==4.57.1, which includes full support for Qwen2.5-72B with this layer class- The configuration is consistent with existing FSDP test cases in the codebase (tests/e2e/multigpu/test_fsdp1.py and test_fsdp2.py use identical class naming)
- The
TRANSFORMER_BASED_WRAPpolicy is compatible with this decoder layer class
29-32: Based on my verification, I have found that NVFP4 quantization support does exist in this Axolotl version, but there are important caveats that should be documented:Rewritten Review Comment:
NVFP4 quantization is supported but in prototype stage—document hardware/API stability requirements.
NVFP4 support was added in PR #3107 using torchao 0.13.0, which provides QAT via
QATConfigwrappingNVFP4InferenceConfig. Gradients do flow through fake quantizers during training as expected. However, torchao's NVFP4 implementation is explicitly prototype-stage with an unstable API, and it requires NVIDIA Blackwell GPUs (CUDA 12.8+). These hardware and stability constraints are not documented in the example or in the QAT configuration guidance, which could mislead users into attempting to use this configuration on unsupported hardware or expecting API stability.Consider adding a note to docs/qat.qmd and to these example configs about the prototype status of NVFP4 and the Blackwell GPU requirement.
examples/qat_nvfp4/Math-Gemma3-27B_qat.yml (3)
40-40: Verify micro-batch size has been tested on target hardware.A
micro_batch_size: 16with a 27B model may be challenging depending on your GPU memory (could cause OOM on smaller/consumer GPUs, or be underutilized on large enterprise GPUs). Confirm this batch size has been validated with your intended training hardware (number of GPUs, GPU type, available VRAM).If you haven't validated this configuration end-to-end, consider starting with a smaller batch size (e.g., 8 or 4) and scaling up based on memory availability and throughput.
9-10: Verify Liger plugin compatibility with NVFP4 quantization.The configuration enables the Liger plugin with multiple kernel optimizations (rope, RMS norm, GLU activation, layer norm, fused cross-entropy) in combination with NVFP4 quantization. Confirm that this combination has been tested and is stable—mixing low-precision quantization with custom optimized kernels can sometimes introduce unexpected numerical behavior.
Run a quick validation to ensure training converges and produces expected results with this plugin + quantization mix.
Also applies to: 12-16
72-73: Confirm ifspecial_tokensis required for the dataset.The
special_tokensfield is empty. Depending on the AI-MO/NuminaMath-CoT dataset's format, you may need to define special tokens for the chat template. If not required, this can be left empty; otherwise, populate it with the necessary tokens.Check the dataset documentation or a quick test run to confirm whether special tokens are needed for proper chat template formatting.
|
Should these configs be under the appropriate |
I thought about that but I feel like it's quite hard to find certain configs that way, and grouping them under a single directory would make it easier. |
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.