Skip to content

Conversation

@Bobbins228
Copy link
Contributor

@Bobbins228 Bobbins228 commented Dec 12, 2025

What does this PR do?

Adds the llama.cpp server as a remote inference provider.

Test Plan

Manually tested provider.

# Add llama_cpp_server as a remote inference provider in your config.yaml
providers:
  inference:
  - provider_id: ${env.LLAMA_CPP_SERVER_URL:+llama-cpp-server}
    provider_type: remote::llama-cpp-server
    config:
      base_url: ${env.LLAMA_CPP_SERVER_URL:=http://localhost:8080/v1}

Run a model using llama server see docs.

llama-server \
  -m qwen2.5-32b-instruct-q4_k_m-00001-of-00005.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --threads 4 \
  --mlock \
  --alias qwen2.5

Get response from the model and basic rag.

from llama_stack_client import LlamaStackClient
import requests
import io

client = LlamaStackClient(base_url="http://localhost:8321")
sources = ["https://www.paulgraham.com/greatwork.html"]

file_ids = []
for source in sources:
    print("Downloading and uploading document:", source)
    
    response = requests.get(source)
    file_content = io.BytesIO(response.content)
    filename = source.split("/")[-1]

    file = client.files.create(
        file=(filename, file_content, "application/pdf"),
        purpose="assistants"
    )
    file_ids.append(file.id)
    print(f"✓ Uploaded {filename} (file_id: {file.id})")

vector_store = client.vector_stores.create(
    name="example_vector_store",
    file_ids=file_ids,
    chunking_strategy={
        "type": "static",
        "static": {
            "max_chunk_size_tokens": 512,
            "chunk_overlap_tokens": 128
        }
    },
    extra_body={
        "embedding_model": "sentence-transformers/nomic-ai/nomic-embed-text-v1.5",
        "embedding_dimension": 768,
        "provider_id": "sqlite-vec"
    }
)
print("Created vector store with ID:", vector_store.id)

resp = client.responses.create(
    model="llama-cpp-server/qwen2.5",
    instructions="you are a helpful assistant",
    tools=[{"type": "file_search", "vector_store_ids": [vector_store.id]}],
    input="""How do you do great work?""",
    stream=True,
)

for chunk in resp:
    if hasattr(chunk, 'type') and chunk.type == "response.output_text.delta":
        if hasattr(chunk, 'delta') and chunk.delta:
            print(chunk.delta, end="", flush=True)
print()

Example output:

To do great work, one can follow several key principles:

1. **Set Clear Goals:** Knowing exactly what you want to achieve helps you stay focused and motivated.
2. **Plan and Organize:** Break your work into manageable tasks and set deadlines. This helps in maintaining progress and avoiding procrastination.
3. **Master Your Skills:** Continuously improve your skills relevant to the task. This could be through training, practice, or seeking feedback.
4. **Stay Motivated:** Keep a positive attitude and remind yourself of the purpose and benefits of your work.
5. **Seek Feedback:** Regular feedback can help you improve and correct any mistakes early on.
6. **Collaborate and Communicate:** Working well with others and clear communication can enhance your productivity and the quality of your work.
7. **Reflect and Adapt:** Regularly reflect on your work process and outcomes. Be open to changing your methods if necessary.

Would you like more detailed information on any of these points?

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 12, 2025
@nathan-weinberg
Copy link
Contributor

I wonder if it might be a bit clearer to use llama-cpp-server for the name of the provider/configs - unfortunately naming collision with these projects but it could prove confusing having references to a "Llama Server" and "Llama Stack Server"

@Bobbins228
Copy link
Contributor Author

@nathan-weinberg Great point, I'll update

@Bobbins228 Bobbins228 force-pushed the llama-server-inference branch from 4acbeab to 1484a8f Compare December 15, 2025 11:19
Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bobbins228 this looks great. please include the output of the inference tests against a llama.cpp server.

@Bobbins228
Copy link
Contributor Author

Thanks @mattf, I have already added an example output to the PR description. Are you looking for it elsewhere?

@mattf
Copy link
Collaborator

mattf commented Dec 15, 2025

Thanks, I have already added an example output to the PR description. Are you looking for it elsewhere?

yes, an integration suite run using this provider. you can limit it to --pattern inference / -k inference

take a look at scripts/integration-tests.sh

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants