Skip to content

Conversation

@SalmanMohammadi
Copy link
Contributor

@SalmanMohammadi SalmanMohammadi commented Nov 26, 2025

Summary by CodeRabbit

  • New Features
    • Added quantization-aware training (QAT) example configurations with NVFP4 quantization support for Gemma3-12B, Gemma3-27B, and Qwen2.5-72B models
    • Includes baseline and math-focused training variants with distributed training and optimization settings

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 26, 2025

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds nine new YAML configuration files for quantization-aware training (QAT) using NVFP4 quantization across multiple LLMs (Gemma3-12B, Gemma3-27B, Qwen2.5-72B). Configurations define model setup, quantization parameters, Liger plugin settings, FSDP distributed training options, optimizer/scheduler settings, and datasets.

Changes

Cohort / File(s) Summary
QAT NVFP4 Baseline Configurations
examples/qat_nvfp4/Gemma3-12B_baseline.yml, examples/qat_nvfp4/Math-Gemma3-12B_baseline.yml, examples/qat_nvfp4/Math-Gemma3-27B_baseline.yml, examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml, examples/qat_nvfp4/Qwen2.5-72B_baseline.yml
New baseline QAT configuration files specifying model, nvfp4 quantization (activation/weight dtypes, group_size 16), Liger optimization features, FSDP v2 distributed training, AdamW fused optimizer, cosine scheduler, and training hyperparameters.
QAT NVFP4 Fine-tuning Configurations
examples/qat_nvfp4/Gemma3-12B_qat.yml, examples/qat_nvfp4/Math-Gemma3-12B_qat.yml, examples/qat_nvfp4/Math-Gemma3-27B_qat.yml, examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml, examples/qat_nvfp4/Qwen2.5-72B_qat.yml
New QAT fine-tuning configuration files with nvfp4 quantization, flash attention, transformer-based FSDP wrapping, activation checkpointing, and various learning rate/warmup settings tailored per model variant.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Verify YAML syntax and schema consistency across all 10 configuration files
  • Cross-check model names, dataset paths, and hyperparameter alignment between baseline and QAT variants
  • Validate FSDP configuration parameters (transformer layer class references, wrapping policies) match target model architectures
  • Ensure quantization settings (nvfp4, group_size) are consistently applied

Possibly related PRs

Suggested labels

ready to merge

Suggested reviewers

  • winglian
  • djsaunde
  • NanoCode012

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add QAT NVFP4 configs for blogpost' accurately summarizes the main change: adding multiple YAML configuration files for QAT with NVFP4 quantization across different models.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
examples/qat_nvfp4/Qwen2.5-72B_baseline.yml (1)

51-51: Consider reducing logging verbosity for large-scale training.

logging_steps: 1 logs on every training step. For a 72B model, this can generate significant overhead and noise. Consider increasing this to 10, 50, or 100 depending on your dataset size and desired observability.

examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml (1)

50-52: Consider enabling checkpoint saving to prevent training loss on failure.

Setting save_strategy: "no" means no intermediate checkpoints are saved during training. If training fails, crashes, or is interrupted mid-way, all progress is lost. This is particularly risky for long-running 72B model training on expensive hardware.

Consider changing this to one of the safer alternatives:

- save_strategy: "no"
+ save_strategy: "steps"
+ save_steps: 100

Alternatively, if you only want to save the final model, ensure you have a separate monitoring/recovery strategy in place.

examples/qat_nvfp4/Math-Gemma3-27B_qat.yml (1)

59-59: Note: Zero weight decay disables L2 regularization.

weight_decay: 0.0 means no L2 regularization is applied. While this may be intentional for QAT (quantization sometimes works better without weight regularization), it's worth confirming this choice aligns with your training objectives. If regularization is desired, consider increasing to a typical value like 0.01.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8990ca3 and 311128a.

📒 Files selected for processing (10)
  • examples/qat_nvfp4/Gemma3-12B_baseline.yml (1 hunks)
  • examples/qat_nvfp4/Gemma3-12B_qat.yml (1 hunks)
  • examples/qat_nvfp4/Math-Gemma3-12B_baseline.yml (1 hunks)
  • examples/qat_nvfp4/Math-Gemma3-12B_qat.yml (1 hunks)
  • examples/qat_nvfp4/Math-Gemma3-27B_baseline.yml (1 hunks)
  • examples/qat_nvfp4/Math-Gemma3-27B_qat.yml (1 hunks)
  • examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml (1 hunks)
  • examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml (1 hunks)
  • examples/qat_nvfp4/Qwen2.5-72B_baseline.yml (1 hunks)
  • examples/qat_nvfp4/Qwen2.5-72B_qat.yml (1 hunks)
🔇 Additional comments (17)
examples/qat_nvfp4/Math-Gemma3-27B_baseline.yml (2)

29-32: Well-structured QAT configuration with documented constraints.

The NVFP4 quantization settings are clear, with the group_size constraint appropriately documented. The FSDP configuration and Liger optimizations are well-integrated for Gemma3-27B distributed training.


59-59: Verify that weight_decay: 0.0 is intentional for this QAT baseline.

Disabling weight decay is unconventional for finetuning. Please confirm this choice aligns with your QAT training strategy, as weight decay often stabilizes training. If this is intentional to isolate quantization effects, consider documenting this decision in a comment.

examples/qat_nvfp4/Math-Gemma3-12B_qat.yml (1)

50-55: Verify checkpoint saving strategy—currently disabled.

With save_strategy: "no" and evals_per_epoch/saves_per_epoch commented out (lines 54-55), no checkpoints will be saved during training. This differs from the regular Gemma3-12B_qat.yml config, which has evaluation and checkpointing enabled.

Clarify the intent: Is this intended for quick validation runs, or should evaluation/checkpoint saving be enabled for actual training?

examples/qat_nvfp4/Gemma3-12B_baseline.yml (1)

1-73: Configuration structure looks correct—consistent with baseline pattern.

The baseline config is well-structured and consistent with other baseline files (Math-Gemma3-12B_baseline.yml). Checkpoint saving is appropriately disabled for baseline runs. Minor note: no issues detected.

examples/qat_nvfp4/Gemma3-12B_qat.yml (1)

50-54: Inconsistency with Math-Gemma3-12B_qat.yml checkpoint strategy.

This file enables evaluation and checkpointing (evals_per_epoch: 1, saves_per_epoch: 1 on lines 53-54), but the parallel Math-Gemma3-12B_qat.yml has these commented out and uses save_strategy: "no".

Both are QAT configs for the same base model—clarify whether this difference is intentional. If not, align the Math version to match this pattern.

examples/qat_nvfp4/Math-Gemma3-12B_baseline.yml (1)

1-73: Baseline configuration is consistent with the baseline pattern.

Math-Gemma3-12B_baseline.yml follows the expected baseline convention (checkpoint saving disabled). No issues detected in this file.

examples/qat_nvfp4/Qwen2.5-72B_qat.yml (2)

39-45: Clarify training hyperparameter choices vs. Math baseline.

This config uses significantly different hyperparameters compared to the Math-Qwen2.5-72B_baseline.yml:

  • Learning rate: 2e-5 (vs. 5e-6 in baseline) — 4x higher
  • Micro batch size: 16 (vs. 8 in baseline) — 2x larger
  • Missing eta_min parameter (baseline has 7e-7)

Verify these differences are intentional based on the dataset and task differences (alpaca vs. NuminaMath), or align them for consistency.


25-25: Sequence length value is correct for this configuration.

The review conflates two different configurations. The file uses sequence_len: 8096 because it trains on the Alpaca dataset (tatsu-lab/alpaca), while the referenced "Math baseline" (Math-Qwen2.5-72B_baseline.yml) uses 4096 because it trains on a different dataset (AI-MO/NuminaMath-CoT). The proper baseline for this config is Qwen2.5-72B_baseline.yml, which also uses sequence_len: 8096 for the same Alpaca dataset. The sequence length is intentionally dataset-dependent, not a typo.

Likely an incorrect or invalid review comment.

examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml (1)

1-73: Config structure and settings look solid.

The Math baseline configuration is well-structured with appropriate hyperparameters for math-focused fine-tuning: conservative learning rate (5e-6), proper cosine scheduler configuration with eta_min, and standard sequence length. FSDP settings align with other configs in this PR.

Once the sequence_len issue in Qwen2.5-72B_qat.yml is resolved and hyperparameter choices are justified, verify that all nine NVFP4 configs mentioned in the PR summary follow consistent patterns for model setup, quantization, and FSDP configuration.

examples/qat_nvfp4/Qwen2.5-72B_baseline.yml (3)

12-16: LGTM: Comprehensive QAT + FSDP + Liger configuration.

The quantization, distributed training, and optimization settings are well-structured:

  • NVFP4 quantization with group_size 16 (correctly documented as the only supported option).
  • All Liger optimizations enabled for Qwen compatibility.
  • FSDP v2 with activation checkpointing, appropriate for 72B model training.
  • bf16 precision and efficient CPU RAM loading enabled.

Assuming the sequence length and checkpoint strategy issues above are clarified, this baseline is well-configured.

Also applies to: 29-32, 61-69


50-52: I'll help you verify this review comment about the checkpoint persistence strategy. Let me start by examining the file and its context.
<function_calls>

cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>
<function_calls>

wc -l examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>

<function_calls>

ls -la examples/qat_nvfp4/

</function_calls>

<function_calls>

head -70 examples/qat_nvfp4/Qwen2.5-72B_baseline.yml | tail -25

</function_calls>

Let me get a better view of the file structure and the specific lines mentioned:
<function_calls>

sed -n '45,60p' examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>

Now let me search for documentation or context about this configuration:
<function_calls>

find examples/qat_nvfp4/ -type f -name ".md" -o -name ".txt" -o -name "README*"

</function_calls>

Let me examine what other baseline configs exist and their checkpoint strategies:
<function_calls>

find examples/ -name "baseline.yml" | head -10

</function_calls>

Let me also search the repository for any documentation about baseline configurations:
<function_calls>

grep -r "save_strategy.no" examples/ --include=".yml" | head -20

</function_calls>

Let me check if there's a README or documentation in the qat_nvfp4 directory:
<function_calls>

find examples/qat_nvfp4 -type f -name ".md" -o -name ".txt" | xargs cat 2>/dev/null

</function_calls>

Let me get the complete baseline file to understand the full context:
<function_calls>

cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>

Now let me search for any documentation or scripts that explain how this baseline is meant to be used:
<function_calls>

find . -name ".md" -o -name ".rst" | xargs grep -l "baseline" 2>/dev/null | head -5

</function_calls>

Let me check what other baseline configs look like for comparison:
<function_calls>

find examples/ -type f -name "baseline.yml" -exec bash -c 'echo "=== {} ==="; grep -A2 "save_strategy" "$1"' _ {} ;

</function_calls>

<function_calls>

python3 << 'EOF'
import os
import yaml

Find all baseline yml files

baseline_files = []
for root, dirs, files in os.walk('examples'):
for f in files:
if 'baseline' in f and f.endswith('.yml'):
baseline_files.append(os.path.join(root, f))

print(f"Found {len(baseline_files)} baseline YAML files:\n")

for bf in sorted(baseline_files):
try:
with open(bf, 'r') as file:
content = yaml.safe_load(file)
save_strategy = content.get('save_strategy', 'NOT SET')
resume_from = content.get('resume_from_checkpoint', 'NOT SET')
print(f"{bf}")
print(f" save_strategy: {save_strategy}")
print(f" resume_from_checkpoint: {resume_from}")
print()
except Exception as e:
print(f"{bf}: Error reading - {e}\n")
EOF


</function_calls>

Let me also check if there's any training script or documentation explaining the purpose:
<function_calls>

find . -type f ( -name ".py" -o -name ".sh" ) -path "/qat_nvfp4/" | head -5

</function_calls>

Let me search for any documentation in the repository root or parent directories:
<function_calls>

find . -maxdepth 3 -name "README*" -o -name "*.md" | grep -E "(^./README|examples)" | head -10

</function_calls>

Let me check the git history to understand the intent of this baseline configuration:
<function_calls>

git log --oneline examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null | head -5

</function_calls>

<function_calls>

git show HEAD:examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null | grep -A5 -B5 "save_strategy"

</function_calls>

Let me check if there's any issue or PR related to this baseline:
<function_calls>

cd examples/qat_nvfp4 && pwd && ls -la

</function_calls>

Let me search for training-related documentation:
<function_calls>

find . -type f -name "*.md" | xargs grep -l "qat_nvfp4|baseline" 2>/dev/null | head -5

</function_calls>

Let me look at the overall file to understand what type of training this is:
<function_calls>

sed -n '1,30p' examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>

Let me check what's in the qat_nvfp4 directory:
<function_calls>

ls -la examples/qat_nvfp4/

</function_calls>

Let me check if there's a README in that directory:
<function_calls>

cat examples/qat_nvfp4/README.md 2>/dev/null || echo "No README found"

</function_calls>

Now let me search the entire repo for context about this baseline or qat_nvfp4:
<function_calls>

find . -type f ( -name ".md" -o -name ".py" -o -name "*.sh" ) | xargs grep -l "qat_nvfp4|NVFP4" 2>/dev/null | head -10

</function_calls>

Let me check if similar patterns exist in other training configs:
<function_calls>

grep -r "resume_from_checkpoint:" examples/ --include="*.yml" -A1 | grep -B1 "save_strategy.*no" | head -20

</function_calls>

Let me look at the complete yml file structure:
<function_calls>

cat examples/qat_nvfp4/Qwen2.5-72B_baseline.yml | head -100

</function_calls>

Now let me check if there's any trainer documentation or config schema:
<function_calls>

find . -name "*.py" | xargs grep -l "save_strategy.*no|checkpoint.*baseline" 2>/dev/null | head -5

</function_calls>

Let me verify by looking at what "baseline" typically means in ML training context by checking similar files:
<function_calls>

grep -r "output_dir" examples/qat_nvfp4/Qwen2.5-72B_baseline.yml

</function_calls>

Let me review the complete file one more time to see the output_dir and understand the training purpose:
<function_calls>

python3 << 'EOF'
import yaml

with open('examples/qat_nvfp4/Qwen2.5-72B_baseline.yml', 'r') as f:
config = yaml.safe_load(f)

print("Key configuration parameters:")
for key in ['output_dir', 'save_strategy', 'resume_from_checkpoint', 'num_train_epochs', 'save_steps', 'eval_strategy', 'eval_steps', 'do_train', 'do_eval']:
if key in config:
print(f" {key}: {config[key]}")
else:
print(f" {key}: NOT SET")
EOF


</function_calls>

Perfect! Now I have the necessary information to verify the review comment. Let me analyze what I found:

Config is intentionally validation-only — no issue.

This baseline configuration is designed as a minimal reference setup for validation and testing, not for producing a trained model. The configuration has:

  • Only 1 epoch (num_train_epochs: 1)
  • No evaluation strategy (eval_strategy not set)
  • save_strategy: "no" (intentional for baseline/validation runs)
  • resume_from_checkpoint empty (correct, since no checkpoints are saved)

Disabling checkpoint saving in a one-epoch validation baseline is the expected and correct behavior. This type of config is used to verify the training pipeline works without consuming disk space or resources.

Likely an incorrect or invalid review comment.


25-26: Now let me check if this configuration has been tested or validated:
<function_calls>

git log --oneline -20 examples/qat_nvfp4/Qwen2.5-72B_baseline.yml 2>/dev/null

</function_calls>
<function_calls>

find . -type f ( -name "test.py" -o -name ".yml" -o -name ".yaml" ) -path "/test" | xargs grep -l "qat|Qwen2.5-72B" 2>/dev/null | head -10

</function_calls>
<function_calls>

find . -type f -name "*.md" | xargs grep -l "8096|sequence_len.*alpaca|batch_size.*8096" 2>/dev/null

</function_calls>
</function_calls>

Let me also check if there's any documentation about these baseline configurations:
<function_calls>

find . -path "/qat_nvfp4" -name ".md" -o -path "/qat_nvfp4*" -name "README*"

</function_calls>
<function_calls>

ls -la examples/qat_nvfp4/

</function_calls>

<function_calls>

grep -r "sequence_len.*8096|8096.sequence" . --include=".md" 2>/dev/null | head -5

</function_calls>

Let me check the model's supported context length:
<function_calls>

Qwen2.5-72B maximum context length supported
</function_search>

<function_calls>

cat > /tmp/analyze_config.py << 'EOF'
import yaml

Load the baseline config

with open('examples/qat_nvfp4/Qwen2.5-72B_baseline.yml', 'r') as f:
baseline = yaml.safe_load(f)

Load the math config for comparison

with open('examples/qat_nvfp4/Math-Qwen2.5-72B_baseline.yml', 'r') as f:
math_config = yaml.safe_load(f)

print("=== ALPACA QWEN2.5-72B BASELINE ===")
print(f"sequence_len: {baseline.get('sequence_len')}")
print(f"micro_batch_size: {baseline.get('micro_batch_size')}")
print(f"gradient_accumulation_steps: {baseline.get('gradient_accumulation_steps')}")
print(f"effective batch size per step: {baseline.get('micro_batch_size') * baseline.get('gradient_accumulation_steps')}")
print(f"total tokens per step: {baseline.get('sequence_len') * baseline.get('micro_batch_size') * baseline.get('gradient_accumulation_steps')}")
print(f"sample_packing: {baseline.get('sample_packing')}")
print(f"flash_attention: {baseline.get('flash_attention')}")
print(f"bf16: {baseline.get('bf16')}")
print(f"activation_checkpointing: {baseline.get('fsdp_config', {}).get('activation_checkpointing')}")

print("\n=== MATH QWEN2.5-72B BASELINE (for comparison) ===")
print(f"sequence_len: {math_config.get('sequence_len')}")
print(f"micro_batch_size: {math_config.get('micro_batch_size')}")
print(f"gradient_accumulation_steps: {math_config.get('gradient_accumulation_steps')}")
print(f"effective batch size per step: {math_config.get('micro_batch_size') * math_config.get('gradient_accumulation_steps')}")
print(f"total tokens per step: {math_config.get('sequence_len') * math_config.get('micro_batch_size') * math_config.get('gradient_accumulation_steps')}")

print("\n=== RATIO ANALYSIS ===")
print(f"Alpaca seq_len / Math seq_len: {baseline.get('sequence_len') / math_config.get('sequence_len')}x")
print(f"Alpaca batch_size / Math batch_size: {baseline.get('micro_batch_size') / math_config.get('micro_batch_size')}x")
print(f"Alpaca total tokens / Math total tokens: {(baseline.get('sequence_len') * baseline.get('micro_batch_size')) / (math_config.get('sequence_len') * math_config.get('micro_batch_size'))}x")
EOF
python /tmp/analyze_config.py


</function_calls>

<function_calls>

git log --all --oneline | grep -i "8096|qwen.*72.*baseline|alpaca.*qwen" | head -10

</function_calls>

Based on my verification, I can now provide the rewritten review comment. Here are the key findings:

Evidence gathered:

  1. Dataset confirmed: The config explicitly uses tatsu-lab/alpaca dataset
  2. Comparison with other configs:
    • Math-Qwen2.5-72B configs use: sequence_len: 4096 + micro_batch_size: 8
    • Alpaca-Qwen2.5-72B configs use: sequence_len: 8096 + micro_batch_size: 16
    • This results in 4x more tokens per training step for Alpaca (129,536 vs 32,768)
  3. Mitigations in place: The config includes activation_checkpointing=true, flash_attention=true, sample_packing=true, FSDP v2, and bf16—all designed to handle large contexts efficiently
  4. Context: This is presented as a "baseline" configuration, suggesting it should be a known working reference

The original review comment raises a valid concern about the high sequence length for Alpaca, but it conflates two separate issues and makes assumptions not fully supported by the evidence.


Verify whether this Alpaca configuration has been validated to run successfully, especially given the 4x higher tokens-per-step (129,536) compared to the Math baselines (32,768).

sequence_len: 8096 with micro_batch_size: 16 is notably high for Alpaca finetuning. While sample_packing: true mitigates token waste from short sequences, the configuration processes 4x more tokens per training step than the Math baseline configs (which use sequence_len: 4096, micro_batch_size: 8). Confirm this setting has been tested with your target hardware to validate memory efficiency, especially since save_strategy: "no" prevents checkpoint validation.

examples/qat_nvfp4/Math-Qwen2.5-72B_qat.yml (2)

64-65: Qwen2DecoderLayer class is correct and compatible with FSDP configuration.

Verification confirms:

  • Qwen2DecoderLayer is the standard decoder layer class for Qwen2/Qwen2.5 models in Hugging Face Transformers
  • The dependency specifies transformers==4.57.1, which includes full support for Qwen2.5-72B with this layer class
  • The configuration is consistent with existing FSDP test cases in the codebase (tests/e2e/multigpu/test_fsdp1.py and test_fsdp2.py use identical class naming)
  • The TRANSFORMER_BASED_WRAP policy is compatible with this decoder layer class

29-32: Based on my verification, I have found that NVFP4 quantization support does exist in this Axolotl version, but there are important caveats that should be documented:

Rewritten Review Comment:

NVFP4 quantization is supported but in prototype stage—document hardware/API stability requirements.

NVFP4 support was added in PR #3107 using torchao 0.13.0, which provides QAT via QATConfig wrapping NVFP4InferenceConfig. Gradients do flow through fake quantizers during training as expected. However, torchao's NVFP4 implementation is explicitly prototype-stage with an unstable API, and it requires NVIDIA Blackwell GPUs (CUDA 12.8+). These hardware and stability constraints are not documented in the example or in the QAT configuration guidance, which could mislead users into attempting to use this configuration on unsupported hardware or expecting API stability.

Consider adding a note to docs/qat.qmd and to these example configs about the prototype status of NVFP4 and the Blackwell GPU requirement.

examples/qat_nvfp4/Math-Gemma3-27B_qat.yml (3)

40-40: Verify micro-batch size has been tested on target hardware.

A micro_batch_size: 16 with a 27B model may be challenging depending on your GPU memory (could cause OOM on smaller/consumer GPUs, or be underutilized on large enterprise GPUs). Confirm this batch size has been validated with your intended training hardware (number of GPUs, GPU type, available VRAM).

If you haven't validated this configuration end-to-end, consider starting with a smaller batch size (e.g., 8 or 4) and scaling up based on memory availability and throughput.


9-10: Verify Liger plugin compatibility with NVFP4 quantization.

The configuration enables the Liger plugin with multiple kernel optimizations (rope, RMS norm, GLU activation, layer norm, fused cross-entropy) in combination with NVFP4 quantization. Confirm that this combination has been tested and is stable—mixing low-precision quantization with custom optimized kernels can sometimes introduce unexpected numerical behavior.

Run a quick validation to ensure training converges and produces expected results with this plugin + quantization mix.

Also applies to: 12-16


72-73: Confirm if special_tokens is required for the dataset.

The special_tokens field is empty. Depending on the AI-MO/NuminaMath-CoT dataset's format, you may need to define special tokens for the chat template. If not required, this can be left empty; otherwise, populate it with the necessary tokens.

Check the dataset documentation or a quick test run to confirm whether special tokens are needed for proper chat template formatting.

@NanoCode012
Copy link
Collaborator

Should these configs be under the appropriate examples/{arch}/qat_nvfp4?

@SalmanMohammadi
Copy link
Contributor Author

Should these configs be under the appropriate examples/{arch}/qat_nvfp4?

I thought about that but I feel like it's quite hard to find certain configs that way, and grouping them under a single directory would make it easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants