server : avoid antiprompt in probabilities of final response by jhen0409 · Pull Request #2849 · ggml-org/llama.cpp

jhen0409 · 2023-08-28T06:52:21Z

This fix the probabilities of stopping_word are included in the final response of /completion.

To test response without stream mode:

curl --url http://localhost:8080/completion --header "Content-Type: application/json" \
  --data '{ "n_probs": 1, "prompt": "Hello my name is", "stop": ["I"] }' | json_pp

With stream mode (see the last event):

curl -N --url http://localhost:8080/completion --header "Content-Type: application/json" \
  --data '{ "stream": true, "n_probs": 1, "prompt": "Hello my name is", "stop": ["I"] }'

Or see the console output in the web UI.

SlyEcho

Yep, that does the trick.

The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in ikawrakow#723. Then, after ikawrakow#958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.

The logic to skip the logprobs of the stop token was originally from ggml-org/llama.cpp#2849, and was later modified as part of ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD. The latter change wasn't included in #723. Then, after #958 got merged, the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2, resulting in truncated logprobs when streaming is off. This commit reverts the logic from ggml-org/llama.cpp#2849, such that the logprobs of the stop token will always be included in the response, when logprobs is enabled. From testing, this matches with the behavior of Fireworks inference server, for both chat completions and text completions endpoints. Also fix logprobs param handling for the text completion endpoint.

server : avoid aniprompt in probabilities of final response

09c10e8

ggerganov changed the title ~~server : avoid aniprompt in probabilities of final response~~ server : avoid antiprompt in probabilities of final response Aug 28, 2023

jhen0409 requested a review from SlyEcho September 1, 2023 01:10

SlyEcho reviewed Sep 1, 2023

View reviewed changes

SlyEcho approved these changes Sep 1, 2023

View reviewed changes

jhen0409 merged commit 571083f into ggml-org:master Sep 2, 2023

jhen0409 deleted the fix-server-final-probs branch September 2, 2023 00:31

sayap mentioned this pull request Nov 22, 2025

Fix truncated logprobs when streaming is off ikawrakow/ik_llama.cpp#998

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : avoid antiprompt in probabilities of final response#2849

server : avoid antiprompt in probabilities of final response#2849
jhen0409 merged 1 commit intoggml-org:masterfrom
jhen0409:fix-server-final-probs

jhen0409 commented Aug 28, 2023 •

edited

Loading

Uh oh!

SlyEcho left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jhen0409 commented Aug 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SlyEcho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jhen0409 commented Aug 28, 2023 •

edited

Loading