server: add OpenAI compatible response format for /completions#10627
server: add OpenAI compatible response format for /completions#10627Nero7991 wants to merge 0 commit intoggml-org:masterfrom
Conversation
|
I think we can just replace the the Beside, @Nero7991 you should add a test in |
|
@ngxson So test_completion.py is currently testing for the existing response format. Unless we're switching completely to OpenAI compat, we'd need to figure out what response type the server is compiled for right? Or should I create a new test_completion_oai_compat.py file? |
ngxson
left a comment
There was a problem hiding this comment.
The idea is good overall, but I think this need a bit more refactoring to make it more "clean"
Also I don't think we should hide it via a compiler flag, it's just not convenient for most users. Another idea is that we can add a specific field for each request, says "oai_compat" and set it to true by default. Users who don't want OAI response need to explicit add "oai_compat": false
I'll propose my approach via another PR
examples/server/server.cpp
Outdated
There was a problem hiding this comment.
the send_final_* is called from inference thread, but what we're doing is only to format the response, which should be done at HTTP layer. I'd suggest to move your code to a new function format_final_*_oaicompat, much like what we have with format_final_response_oaicompat
There was a problem hiding this comment.
I noticed this yesterday. I found that there's a function called handle_completions_generic (there was a TODO suggesting merging that with handle_chat_completions. I've done that. I'll create another PR with that since it's probably the right way to do it.
I can probably do the oai_compatset to true by default later and send a PR draft
|
Be aware that I'm doing a big refactoring in #10643 to reduce usage of JSON internally. This can introduce quite a lot of conflicts to your code. |
Support for full (almost) OpenAI API response format for the completion related endpoints (including when logprobs is specified)
The frontend is also modified to support this format as well as the existing format, so it remains functional.
HELM benchmarks from CRFM have support for a OpenAI compatible API server, this enables testing differently quantized models for degradation against this benchmark. Tested it on a QwQ Preview 32B GGUF Q4_K_M to evaluate the model against other frontier models.
This support can be compiled by using
OAI_FULL_COMPATpre compiler definition like so:Using
make:make CXXFLAGS="-DOAI_FULL_COMPAT" llama-serverWhen compiled correctly and after running
llama-server, the output should includeINFO: OpenAI full compatibility mode enabledas seen in the following output snippet:Example:
Response: