Skip to content

The logprobs in the ChatCompletion's responses is incorrectly reusing the Completions's schema; not following the OpenAI API's spec #3179

@ghost

Description

According to the OpenAI's documentation, the logprobs are returned in a different format when using the Chat Completion API, which is different from the format used in the old Completions one:
https://platform.openai.com/docs/api-reference/chat/create
image1

But vLLM uses the Completion API's way to return the logprobs for the ChatCompletion's responses too.
This is the example chat completion output from the OpenAI API's documentation:

{
  "id": "chatcmpl-123",
  "object": "chat.completion",
  "created": 1702685778,
  "model": "gpt-3.5-turbo-0125",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "content": [
          {
            "token": "Hello",
            "logprob": -0.31725305,
            "bytes": [72, 101, 108, 108, 111],
            "top_logprobs": [
              {
                "token": "Hello",
                "logprob": -0.31725305,
                "bytes": [72, 101, 108, 108, 111]
              },
              {
                "token": "Hi",
                "logprob": -1.3190403,
                "bytes": [72, 105]
              }
            ]
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 1,
    "total_tokens": 11
  },
  "system_fingerprint": null
}

And this is what vLLM is returning with commit 901cf4c:

{
  "id": "cmpl-feb5333e02ef436e95c09b8f2255e4c0",
  "object": "chat.completion",
  "created": 997761,
  "model": "Qwen/Qwen1.5-72B-Chat-GPTQ-Int4",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello"
      },
      "logprobs": {
        "text_offset": [
          0
        ],
        "token_logprobs": [
          -0.015691734850406647
        ],
        "tokens": [
          "Hello"
        ],
        "top_logprobs": [
          {
            "Hello": -0.015691734850406647,
            "How": -5.140691757202148
          }
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "total_tokens": 11,
    "completion_tokens": 1
  }
}

Here's how the Completion's LogProbs schema is reused inside the ChatCompletion response:

class LogProbs(BaseModel):
text_offset: List[int] = Field(default_factory=list)
token_logprobs: List[Optional[float]] = Field(default_factory=list)
tokens: List[str] = Field(default_factory=list)
top_logprobs: Optional[List[Optional[Dict[int, float]]]] = None

class ChatCompletionResponseChoice(BaseModel):
index: int
message: ChatMessage
logprobs: Optional[LogProbs] = None
finish_reason: Optional[Literal["stop", "length"]] = None

And then the _create_logprobs function that was made for the Completion API is reused in serving_chat.py:

logprobs = self._create_logprobs(

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions