-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
Description
Your current environment
The output of python collect_env.py
Collecting environment information...
==============================
System Info
==============================
OS : macOS 15.6.1 (arm64)
GCC version : Could not collect
Clang version : 17.0.0 (clang-1700.0.13.5)
CMake version : Could not collect
Libc version : N/A
==============================
PyTorch Info
==============================
PyTorch version : 2.7.1
Is debug build : False
CUDA used to build PyTorch : None
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] (64-bit runtime)
Python platform : macOS-15.6.1-arm64-arm-64bit
==============================
CUDA / GPU Info
==============================
Is CUDA available : False
CUDA runtime version : No CUDA
CUDA_MODULE_LOADING set to : N/A
GPU models and configuration : No CUDA
Nvidia driver version : No CUDA
cuDNN version : No CUDA
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Apple M4 Max
==============================
Versions of relevant libraries
==============================
[pip3] curated-transformers==0.1.1
[pip3] mypy==1.17.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] pyzmq==27.0.2
[pip3] sentence-transformers==3.4.1
[pip3] spacy-curated-transformers==0.3.1
[pip3] torch==2.7.1
[pip3] torchaudio==2.7.1
[pip3] torchvision==0.22.1
[pip3] transformers==4.55.0
[conda] Could not collect
==============================
vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.10.1.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
Could not collect
==============================
Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
🐛 Describe the bug
When passing a batch of prompts to .chat() on the LLM class it tries to validate and add requests one by one to the internal scheduler queue. If for example one of the requests raises a ValueError due to tokens exceeding the max model length previous requests have already been added to the scheduler queue.
Since the exception stops _run_engine() from being called and draining the queue there are now orphaned requests in the scheduler queue. In our code base we catch the exception and try to split the batch in 2 halves to isolate the offending request and make progress on the unproblematic ones. However, when we call .chat() after catching the exception it's output now contains more outputs than inputs (fresh batch + orphaned requests).
The LLM class does not expose the request_ids in anyway and as such it is not possible to correlate inputs and outputs when this happens. It also violates the contract of .chat() that "outputs are returned in the same order as inputs"
Simplified example:
def infer(chat_messages):
try:
outputs = model.chat(chat_messages )
except Exception as e:
if len(chat_messages) == 1:
return [ErrorOutput(...)]
first_half = chat_messages[...]
second_half = chat_messages[...]
return infer(first_half) + infer(second_half)
return outputs
Running the above scenario we saw that if we hit the except case the length of outputs was very often longer than the length of chat_messages passed to the method because of orphaned requests from previous calls to infer()
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.