feat: add llama cpp server remote inference provider #4382

Bobbins228 · 2025-12-12T17:07:34Z

What does this PR do?

Adds the llama.cpp server as a remote inference provider.

Test Plan

Manually tested provider.

# Add llama_cpp_server as a remote inference provider in your config.yaml
providers:
  inference:
  - provider_id: ${env.LLAMA_CPP_SERVER_URL:+llama-cpp-server}
    provider_type: remote::llama-cpp-server
    config:
      base_url: ${env.LLAMA_CPP_SERVER_URL:=http://localhost:8080/v1}

Run a model using llama server see docs.

llama-server \
  -m qwen2.5-32b-instruct-q4_k_m-00001-of-00005.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --threads 4 \
  --mlock \
  --alias qwen2.5

Get response from the model and basic rag.

from llama_stack_client import LlamaStackClient
import requests
import io

client = LlamaStackClient(base_url="http://localhost:8321")
sources = ["https://www.paulgraham.com/greatwork.html"]

file_ids = []
for source in sources:
    print("Downloading and uploading document:", source)
    
    response = requests.get(source)
    file_content = io.BytesIO(response.content)
    filename = source.split("/")[-1]

    file = client.files.create(
        file=(filename, file_content, "application/pdf"),
        purpose="assistants"
    )
    file_ids.append(file.id)
    print(f"✓ Uploaded {filename} (file_id: {file.id})")

vector_store = client.vector_stores.create(
    name="example_vector_store",
    file_ids=file_ids,
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 512,
            "chunk_overlap_tokens": 128
        }
    },
    extra_body={
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768,
        "provider_id": "sqlite-vec"
    }
)
print("Created vector store with ID:", vector_store.id)

resp = client.responses.create(
    model="llama-cpp-server/qwen2.5",
    instructions="you are a helpful assistant",
    tools=[{"type": "file_search", "vector_store_ids": [vector_store.id]}],
    input="""How do you do great work?""",
    stream=True,
)

for chunk in resp:
    if hasattr(chunk, 'type') and chunk.type == "response.output_text.delta":
        if hasattr(chunk, 'delta') and chunk.delta:
            print(chunk.delta, end="", flush=True)
print()

Example output:

To do great work, one can follow several key principles:

1. **Set Clear Goals:** Knowing exactly what you want to achieve helps you stay focused and motivated.
2. **Plan and Organize:** Break your work into manageable tasks and set deadlines. This helps in maintaining progress and avoiding procrastination.
3. **Master Your Skills:** Continuously improve your skills relevant to the task. This could be through training, practice, or seeking feedback.
4. **Stay Motivated:** Keep a positive attitude and remind yourself of the purpose and benefits of your work.
5. **Seek Feedback:** Regular feedback can help you improve and correct any mistakes early on.
6. **Collaborate and Communicate:** Working well with others and clear communication can enhance your productivity and the quality of your work.
7. **Reflect and Adapt:** Regularly reflect on your work process and outcomes. Be open to changing your methods if necessary.

Would you like more detailed information on any of these points?

nathan-weinberg · 2025-12-12T20:28:20Z

I wonder if it might be a bit clearer to use llama-cpp-server for the name of the provider/configs - unfortunately naming collision with these projects but it could prove confusing having references to a "Llama Server" and "Llama Stack Server"

Bobbins228 · 2025-12-15T11:11:12Z

@nathan-weinberg Great point, I'll update

mattf

@Bobbins228 this looks great. please include the output of the inference tests against a llama.cpp server.

Bobbins228 · 2025-12-15T14:46:33Z

Thanks @mattf, I have already added an example output to the PR description. Are you looking for it elsewhere?

mattf · 2025-12-15T14:56:00Z

Thanks, I have already added an example output to the PR description. Are you looking for it elsewhere?

yes, an integration suite run using this provider. you can limit it to --pattern inference / -k inference

take a look at scripts/integration-tests.sh

Bobbins228 requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners December 12, 2025 17:07

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 12, 2025

feat: add llama cpp server remote inference provider

1484a8f

Bobbins228 force-pushed the llama-server-inference branch from 4acbeab to 1484a8f Compare December 15, 2025 11:19

docs: add automated docs for llama cpp server inference provider

445b69f

mattf approved these changes Dec 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add llama cpp server remote inference provider #4382

feat: add llama cpp server remote inference provider #4382

Uh oh!

Bobbins228 commented Dec 12, 2025 •

edited

Loading

Uh oh!

nathan-weinberg commented Dec 12, 2025

Uh oh!

Bobbins228 commented Dec 15, 2025

Uh oh!

mattf left a comment

Uh oh!

Bobbins228 commented Dec 15, 2025

Uh oh!

mattf commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: add llama cpp server remote inference provider #4382

Are you sure you want to change the base?

feat: add llama cpp server remote inference provider #4382

Uh oh!

Conversation

Bobbins228 commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Test Plan

Uh oh!

nathan-weinberg commented Dec 12, 2025

Uh oh!

Bobbins228 commented Dec 15, 2025

Uh oh!

mattf left a comment

Choose a reason for hiding this comment

Uh oh!

Bobbins228 commented Dec 15, 2025

Uh oh!

mattf commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Bobbins228 commented Dec 12, 2025 •

edited

Loading