[Bug]: .chat() does not clean up in case of validation failure

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
Collecting environment information...
==============================
        System Info
==============================
OS                           : macOS 15.6.1 (arm64)
GCC version                  : Could not collect
Clang version                : 17.0.0 (clang-1700.0.13.5)
CMake version                : Could not collect
Libc version                 : N/A

==============================
       PyTorch Info
==============================
PyTorch version              : 2.7.1
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.5 (main, Aug 14 2024, 04:32:18) [Clang 18.1.8 ] (64-bit runtime)
Python platform              : macOS-15.6.1-arm64-arm-64bit

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Apple M4 Max

==============================
Versions of relevant libraries
==============================
[pip3] curated-transformers==0.1.1
[pip3] mypy==1.17.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] pyzmq==27.0.2
[pip3] sentence-transformers==3.4.1
[pip3] spacy-curated-transformers==0.3.1
[pip3] torch==2.7.1
[pip3] torchaudio==2.7.1
[pip3] torchvision==0.22.1
[pip3] transformers==4.55.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.10.1.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
```

</details>


### 🐛 Describe the bug

When passing a batch of prompts to `.chat()` on the LLM class it tries to validate and add requests one by one to the internal scheduler queue. If for example one of the requests raises a ValueError due to tokens exceeding the max model length previous requests have already been added to the scheduler queue.

Since the exception stops `_run_engine()` from being called and draining the queue there are now orphaned requests in the scheduler queue. In our code base we catch the exception and try to split the batch in 2 halves to isolate the offending request and make progress on the unproblematic ones. However, when we call `.chat()` after catching the exception it's output now contains more outputs than inputs (fresh batch + orphaned requests).

The LLM class does not expose the request_ids in anyway and as such it is not possible to correlate inputs and outputs when this happens. It also violates the contract of `.chat()` that "outputs are returned in the same order as inputs"

Simplified example:

```
def infer(chat_messages):
    try:
        outputs = model.chat(chat_messages )
    except Exception as e:
       if len(chat_messages) == 1:
         return [ErrorOutput(...)]

       first_half = chat_messages[...]
       second_half = chat_messages[...]

       return infer(first_half) + infer(second_half)

  return outputs
```
Running the above scenario we saw that if we hit the except case the length of `outputs` was very often longer than the length of `chat_messages` passed to the method because of orphaned requests from previous calls to `infer()`

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: .chat() does not clean up in case of validation failure #26081

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: .chat() does not clean up in case of validation failure #26081

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions