-
-
Notifications
You must be signed in to change notification settings - Fork 11.9k
Description
Your current environment
The output of python collect_env.py
Your output of `python collect_env.py` here
🐛 Describe the bug
Please help me to quantize the model,and mention which flag should i use for quantization.
Please find my deployment spec Model - meta-llama/Llama-3.3-70B-Instruct:
- --quantization
- fp8
- --kv-cache-dtype
- fp8
Enabling "quantization fp8" is failing deployment.
Spec:
NVIDIA-SMI 550.127.08 Driver Version: 550.127.08 CUDA Version: 12.9
Error:
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:42 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f9b9b4d9eb0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0x111c7 (0x7f9b9b56c1c7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::symmetric_memory::AllocationRef::~AllocationRef() + 0xce (0x7f9b3eb25ece in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::symmetric_memory::Block::~Block() + 0x1d2 (0x7f9b3eb2cc92 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::symmetric_memory::CUDASymmetricMemoryAllocator::free(void*) + 0x1de (0x7f9b3eb247fe in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: + 0x30ddc (0x7f9b9b478ddc in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #6: + 0x434a5f (0x7f9b8e172a5f in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f9b9b4b7179 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #8: + 0x6c9e58 (0x7f9b8e407e58 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #9: + 0x6ca22d (0x7f9b8e40822d in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_python.so)
frame #10: VLLM::Worker_TP6() [0x59b914]
frame #11: VLLM::Worker_TP6() [0x59b914]
frame #12: VLLM::Worker_TP6() [0x53be84]
frame #13: VLLM::Worker_TP6() [0x59bded]
frame #14: VLLM::Worker_TP6() [0x5312d3]
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.