Skip to content

Latest commit

 

History

History
27 lines (19 loc) · 1.04 KB

File metadata and controls

27 lines (19 loc) · 1.04 KB

Quantization using llmcompressor

For model quantization, we recommend using llmcompressor.

The follow examples should be run from the root of the repository.

Example (sample output):

./scripts/quantize.py -o /tmp/cosmos-reason2/checkpoints

To list available arguments:

./scripts/quantize.py --help

Quantization Options

Option Values Default Description
--model string nvidia/Cosmos-Reason2-2B Model name or local path
--precision nvfp4, fp8, fp8_dynamic nvfp4 nvfp4 (smallest/fastest), fp8 (better quality), fp8_dynamic (best quality, slower inference)
--kv-precision bf16, fp8 bf16 KV cache precision
--num-samples integer 512 Calibration samples. Increase for better accuracy & longer runtime
--smoothing-strength 0.0-1.0 0.8 SmoothQuant strength for handling outliers