For model quantization, we recommend using llmcompressor.
The follow examples should be run from the root of the repository.
./scripts/quantize.py -o /tmp/cosmos-reason2/checkpointsTo list available arguments:
./scripts/quantize.py --help| Option | Values | Default | Description |
|---|---|---|---|
--model |
string | nvidia/Cosmos-Reason2-2B |
Model name or local path |
--precision |
nvfp4, fp8, fp8_dynamic |
nvfp4 |
nvfp4 (smallest/fastest), fp8 (better quality), fp8_dynamic (best quality, slower inference) |
--kv-precision |
bf16, fp8 |
bf16 |
KV cache precision |
--num-samples |
integer | 512 |
Calibration samples. Increase for better accuracy & longer runtime |
--smoothing-strength |
0.0-1.0 | 0.8 |
SmoothQuant strength for handling outliers |