-
Notifications
You must be signed in to change notification settings - Fork 124
Description
I would like to make a humble suggestion for an additional feature in llama-swap. When I'm
configuring front ends, I would like to access meta-data about the different models provided by my
llama-swap instance, preferably in the response from /v1/models
Take for example the context window, I'd like to specify it at exactly one place, and have this
information used for both the command line options, and the response of /v1/models, I could
envision a syntax loosely along the lines of:
models:
llamacpp-mistral-small-3.2-24b-2506:
macros:
- context_len=24000
- n_concurrent=2
cmd: |
llama-server
--port ${PORT}
--hf-repo bartowski/mistralai_Mistral-Small-3.2-24B-Instruct-2506-GGUF:Q6_K_L
--jinja
--ctx-size ${context_len}
--cache-type-k q8_0
--cache-type-v q5_1
--parallel ${n_concurrent}
--flash-attn
--temp 0.15
metadata:
meta:
- context_window: ${context_len}
- concurrency: ${n_concurrent}
- mime-types:
- "image/jpeg"
- "image/png"
proxy: http://127.0.0.1:${PORT}Here macros could now be scoped to respective model (and not only global as I believe is the case today), and metadata would be a new keyword in llama-swap's yaml parser. And the
response from /v1/models could perhaps look like:
{
"object": "list",
"data": [
{
"id": "llamacpp-mistral-small-3.2-24b-2506",
"object": "model",
"created": 1686935002,
"owned_by": "llama-swap"
"meta": {
"context_window": 24000,
"concurrency": 2,
"mime-types": [
"image/jpeg",
"image/png"
]
}
},
],
"object": "list"
}There is some precedence to adding extra fields to the response in /v1/models, consider e.g. what
llama.cpp does:
$ curl -s -X GET http://localhost:8686/upstream/llamacpp-Qwen3-Coder-30B-A3B-it/v1/models | jq
{
"models": [
{
"name": "/root/.cache/llama.cpp/unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf",
"model": "/root/.cache/llama.cpp/unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf",
"modified_at": "",
"size": "",
"digest": "",
"type": "model",
"description": "",
"tags": [
""
],
"capabilities": [
"completion"
],
"parameters": "",
"details": {
"parent_model": "",
"format": "gguf",
"family": "",
"families": [
""
],
"parameter_size": "",
"quantization_level": ""
}
}
],
"object": "list",
"data": [
{
"id": "/root/.cache/llama.cpp/unsloth_Qwen3-Coder-30B-A3B-Instruct-GGUF_Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf",
"object": "model",
"created": 1755869917,
"owned_by": "llamacpp",
"meta": {
"vocab_type": 2,
"n_vocab": 151936,
"n_ctx_train": 262144,
"n_embd": 2048,
"n_params": 30532122624,
"size": 17659361280
}
}
]
}
In this case, not all information I need to configure the front end is available, anyhow it would be
infeasible for me to use the /upstream/ path, since that would mean loading each and every model
in my llama-swap config.
Resources
- OpenAI's documentation of /v1/models: https://platform.openai.com/docs/api-reference/models/list
- An example issue of how rich output from
/v1/modelscould be used to configure a frontend (example is for gptel with emacs).