LlamaLib C# Guide

_{Quick Start  •
Building Your Project  •
Core Classes  •
LLMService  •
LLMClient  •
LLMAgent  •}

Quick Start

Minimal Agent Example

using System;
using UndreamAI.LlamaLib;

class Program
{
    static string previousText = "";
    static void StreamingCallback(string text)
    {
        Console.Write(text.Substring(previousText.Length));
        previousText = text;
    }

    static void Main()
    {
        // Create the LLM
        LLMService llmService = new LLMService("model.gguf");
        llmService.Start();

        // Create an agent with a system prompt
        LLMAgent agent = new LLMAgent(llmService, "You are a helpful AI assistant named Eve. Be concise and friendly.");

        // Interact with the agent (non-streaming)
        string response = agent.Chat("what is your name?");
        Console.WriteLine(response);

        // Interact with the agent (streaming)
        agent.Chat("how are you?", true, StreamingCallback);
    }
}

Minimal LLM Functions Example

using System;
using UndreamAI.LlamaLib;
using System.Collections.Generic;
using Newtonsoft.Json.Linq;

class Program
{
    static string previousText = "";
    static void StreamingCallback(string text)
    {
        Console.Write(text.Substring(previousText.Length));
        previousText = text;
    }

    static void Main()
    {
        // Create the LLM
        LLMService llm = new LLMService("model.gguf");
        llm.Start();

        // Optional: limit the amount of tokens that we can predict so that it doesn't produce text forever (some models do)
        llm.SetCompletionParameters(new JObject { { "n_predict", 20 } });

        string prompt = "The largest planet of our solar system";

        // Tokenization
        List<int> tokens = llm.Tokenize(prompt);
        Console.WriteLine($"Token count: {tokens.Count}");

        // Detokenization
        string text = llm.Detokenize(tokens);
        Console.WriteLine($"Text: {text}");

        // Streaming completion
        Console.Write("Response: ");
        llm.Completion(prompt, StreamingCallback);
        Console.WriteLine();

        // Non-streaming completion
        string response = llm.Completion(prompt);
        Console.WriteLine($"Response: {response}");
    }
}

Minimal Embeddings Example

using System;
using System.Collections.Generic;
using UndreamAI.LlamaLib;

class Program
{
    static void Main()
    {
        // Create the LLM
        LLMService llm = new LLMService("model.gguf", embeddingOnly: true);
        llm.Start();

        // Embeddings
        List<float> embeddings = llm.Embeddings("my text to embed goes here");
        Console.WriteLine($"Embedding dimensions: {embeddings.Count}");
        Console.Write("Embeddings: ");
        for (int i = 0; i < Math.Min(10, embeddings.Count); i++)
            Console.Write($"{embeddings[i]} ");
        Console.WriteLine("...");
    }
}

Building Your Project

Install the LlamaLib NuGet package in your project
Download your favourite model in .gguf format (Hugging Face link)

Directory Structure

my-project/
├── Program.cs
├── MyProject.csproj
└── model.gguf          # Your LLM model file

Project Setup

Using .NET CLI

# Create new console application
dotnet new console -n LlamaLibProject
cd LlamaLibProject

# modify the Program.cs according to your usecase

# Add LlamaLib package
dotnet add package LlamaLib

# Build
dotnet build

# Run
dotnet run

Using Visual Studio

Create a new Console Application project
Right-click on project → Manage NuGet Packages
Search for "LlamaLib" and install
Build and run your application

Runtime Identifier Selection

LlamaLib supports multiple target frameworks and automatically selects the appropriate native libraries for your platform:

Supported Frameworks:

netstandard2.0
net6.0
net8.0, net8.0-ios, net8.0-visionos, net8.0-android

Platform Support:

You can specify the runtime identifier to select the target platform

# Windows
dotnet build -r win-x64

# Linux
dotnet build -r linux-x64

# macOS Intel
dotnet build -r osx-x64

# macOS Apple Silicon
dotnet build -r osx-arm64

# Android ARM64
dotnet build -r android-arm64

# Android x64
dotnet build -r android-x64

# iOS
dotnet build -r ios-arm64

# visionOS (there is no runtime id for VisionOS)
dotnet build -r osx-arm64 -p:LlamaLibVisionOS=true

Architecture Selection:

LlamaLib automatically downloads and includes the appropriate native libraries based on your runtime identifier. For desktop platforms (Windows, Linux, macOS), the library includes runtime architecture detection and will automatically select the best backend for the hardware on runtime.

Alternatively you can specify the architecture(s) you would like to bundle with your build@

# Windows / Linux

## CPU options
## CPUs without AVX instructions
dotnet build -p:LlamaLibUseNoAvx=true
## CPUs with AVX instructions
dotnet build -p:LlamaLibUseAvx=true
## CPUs with AVX2 instructions
dotnet build -p:LlamaLibUseAvx2=true
## CPUs with AVX512 instructions
dotnet build -p:LlamaLibUseAvx512=true

## GPU options
## Compact NVIDIA CUDA
dotnet build -p:LlamaLibUseTinyblas=true
## NVIDIA CUDA (CUBLAS)
dotnet build -p:LlamaLibUseCuBlas=true
## Cross-platform GPU support (Nvidia, AMD)
dotnet build -p:LlamaLibUseVulkan=true

## Multiple architectures can be provided e.g.
dotnet build -p:LlamaLibUseNoAvx=true  -p:LlamaLibUseAvx=true -p:LlamaLibUseAvx2=true -p:LlamaLibUseAvx512=true -p:LlamaLibUseTinyblas=true

## Disable all GPU architectures (keep only CPU)
dotnet build -p:LlamaLibAllowGPU=false
## Disable all CPU architectures (keep only GPU)
dotnet build -p:LlamaLibAllowCPU=false

################################################

# OSX
## Accelerate framework support
dotnet build -p:LlamaLibUseAccelerate=true
## No accelerate framework support
dotnet build -p:LlamaLibUseNoAccelerate=true
---

## Core Classes

LlamaLib provides three main classes:

| Class | Purpose | Use When |
|-------|---------|----------|
| **LLMService** | LLM backend | Building standalone apps or servers |
| **LLMClient** | Local or remote LLM access | Connecting to existing LLM services |
| **LLMAgent** | Conversational AI with memory | Building chatbots or interactive AI |

---

## LLMService

The class handling the LLM functionality.

### Construction functions
#### Construction

```csharp
// Basic construction
public LLMService(string modelPath);

// Full parameters
public LLMService(
    string modelPath,
    int numSlots = 1,              // Parallel request slots
    int numThreads = -1,           // CPU threads (-1 = auto)
    int numGpuLayers = 0,          // GPU layer offloading
    bool flashAttention = false,   // Flash attention optimization
    int contextSize = 4096,        // Context window size
    int batchSize = 2048,          // Processing batch size
    bool embeddingOnly = false,    // Embedding-only mode
    string[] loraPaths = null      // LoRA adapters
);

Examples:

// CPU-only LLM with 8 threads
LLMService llm = new LLMService("model.gguf", numThreads: 8);

// GPU usage
LLMService llm = new LLMService("model.gguf", numGpuLayers: 20);

// Embedding LLM
LLMService llm = new LLMService("model.gguf", embeddingOnly: true);

Construction based on llama.cpp command

// From command line string
public static LLMService FromCommand(string command);

Example:

// Command line style using a respective llama.cpp command
LLMService llm = LLMService.FromCommand("-m model.gguf -ngl 32 -c 4096");

LoRA Adapters

public bool LoraWeight(List<LoraIdScale> loras);
public bool LoraWeight(params LoraIdScale[] loras);
public List<LoraIdScalePath> LoraList();
public string LoraListJSON();

Example:

string[] loras = new string[] { "lora.gguf" };
LLMService llm = new LLMService("model.gguf", loraPaths: loras);
llm.Start();

// Configure LoRA adapters
List<LoraIdScale> loraWeights = new List<LoraIdScale>
{
    new LoraIdScale(0, 0.5f)  // LoRA ID 0 with scale 0.5
};
llm.LoraWeight(loraWeights);

// List available adapters
List<LoraIdScalePath> available = llm.LoraList();
foreach (var lora in available)
{
    Console.WriteLine($"ID: {lora.Id}, Scale: {lora.Scale}");
}

Service functions

Service Lifecycle

public void Start();             // Start the LLM service
public bool Started();           // Check if running
public void Stop();              // Stop the service
public void JoinService();       // Wait until the LLM service is terminated

Example:

LLMService llm = new LLMService("model.gguf");
llm.Start();

// Do work...
if (llm.Started())
{
    Console.WriteLine("Service is running");
}

llm.Stop();
llm.JoinService();  // Wait for clean shutdown

HTTP Server

public void StartServer(                    // Start remote server
    string host = "0.0.0.0",                  // Server IP
    int port = -1,                            // Server port (-1 = auto-select)
    string apiKey = ""                        // key for accessing the services
);
public void StopServer();                   // Stop remote server
public void JoinServer();                   // Wait until the remote server is terminated
public void SetSSL(string sslCert, string sslKey);  // Set a SSL certificate and private key

Example:

LLMService llm = new LLMService("model.gguf");
llm.Start();

// Start HTTP server on port 8080
llm.StartServer("0.0.0.0", 8080);

// With API key authentication
llm.StartServer("0.0.0.0", 8080, "my-secret-key");

// With SSL
string serverKey = "-----BEGIN PRIVATE KEY-----\n" +
                   "...\n" +
                   "-----END PRIVATE KEY-----\n";

string serverCrt = "-----BEGIN CERTIFICATE-----\n" +
                   "...\n" +
                   "-----END CERTIFICATE-----\n";

llm.SetSSL(serverCrt, serverKey);
llm.StartServer("0.0.0.0", 8443);

llm.StopServer();
llm.JoinServer();

Slot Management

public string SaveSlot(int idSlot, string filepath);
public string LoadSlot(int idSlot, string filepath);
public void Cancel(int idSlot);

Example:

// LLM with 2 parallel slots
LLMService llm = new LLMService("model.gguf", numSlots: 2);
llm.Start();

// Generate with specific slot
int slot = 0;
llm.Completion("Hello", null, slot);
// Cancel completion for the slot
llm.Cancel(slot);
// Save context state
llm.SaveSlot(slot, "conversation.state");
// Restore context state
llm.LoadSlot(slot, "conversation.state");

Debugging

public static void Debug(int debugLevel);                         // set debug level
public static void LoggingCallback(CharArrayCallback callback);   // set logging callback function
public static void LoggingStop();                                 // stop logging callbacks

Debug Levels:

0: No debug output
1: LlamaLib messages
2+: llama.cpp messages (verbose)

Example:

// Enable verbose logging
LLM.Debug(2);

// Custom log handler
LlamaLib.CharArrayCallback logHandler = (message) =>
{
    Console.WriteLine($"[LLM] {message}");
};
LLM.LoggingCallback(logHandler);

// Stop logging
LLM.LoggingStop();

Core functions

Text Generation

// Simple completion
public string Completion(
    string prompt,
    LlamaLib.CharArrayCallback callback = null,   // Streaming callback: void MethodName(string text)
    int idSlot = -1,                              // Slot to assign the completion (-1 = auto)
    bool returnResponseJson = false               // Return full JSON
);

Example:

// Basic completion
string response = llm.Completion("What is AI?");

// Streaming completion
LlamaLib.CharArrayCallback callback = (text) =>
{
    Console.WriteLine(text.Length);
};
llm.Completion("Tell me a story", callback);

Completion Parameters

The list of completion parameters can be found on llama.cpp completion parameters. More info on each parameter can be found here.

public void SetCompletionParameters(JObject parameters);
public JObject GetCompletionParameters();

Common Parameters:

using Newtonsoft.Json.Linq;

llm.SetCompletionParameters(new JObject
{
    { "temperature", 0.7 },       // Randomness (0.0-2.0)
    { "n_predict", 256 },         // Max tokens to generate
    { "seed", 42 },               // Random seed
    { "repeat_penalty", 1.1 }     // Repetition penalty
});

Tokenization

public List<int> Tokenize(string content);
public string Detokenize(List<int> tokens);
public string Detokenize(int[] tokens);

Example:

// Text to tokens
List<int> tokens = llm.Tokenize("Hello world");
Console.WriteLine($"Token count: {tokens.Count}");

// Tokens to text
string text = llm.Detokenize(tokens);

Embeddings

The embeddings require to set the embeddingOnly flag during construction.
For well-defined embeddings you should use a model specifically trained for embeddings (good options can be found at the MTEB leaderboard).

public List<float> Embeddings(string content);
public int EmbeddingSize();

Example:

// Generate embeddings
List<float> vec = llm.Embeddings("Sample text");
Console.WriteLine($"Embedding dimensions: {llm.EmbeddingSize()}");

Chat Templates

public string ApplyTemplate(JArray messages);

Example:

using Newtonsoft.Json.Linq;

JArray messages = new JArray
{
    new JObject { { "role", "system" }, { "content", "You are a helpful assistant" } },
    new JObject { { "role", "user" }, { "content", "Hello!" } }
};

string formatted = llm.ApplyTemplate(messages);
string response = llm.Completion(formatted);

Grammar & Constrained Generation

To restrict the output of the LLM you can use a grammar, read more here.
Grammars in both gbnf and json schema format are supported.

public void SetGrammar(string grammar);
public string GetGrammar();

Example:

// JSON schema constraint
llm.SetGrammar(@"{
    ""type"": ""object"",
    ""properties"": {
        ""name"": {""type"": ""string""},
        ""age"": {""type"": ""number""}
    }
}");

string response = llm.Completion("Generate a person");
// Response will be valid JSON matching the schema

LLMClient

Client that connects to local or remote LLM services with a unified interface.
All core LLM operations specified in Core Functions work in the same way for the LLMClient class.

Construction Methods

// Local client (wraps LLMService)
public LLMClient(LLMProvider provider);

// Remote client (connects via HTTP)
public LLMClient(
    string url,
    int port,
    string apiKey = "",
    int numRetries = 5
);

Local Client Example

using System;
using System.Collections.Generic;
using UndreamAI.LlamaLib;

class Program
{
    static string previousText = "";
    static void StreamingCallback(string text)
    {
        Console.Write(text.Substring(previousText.Length));
        previousText = text;
    }

    static void Main()
    {
        // Create LLM
        LLMService llmService = new LLMService("model.gguf");
        llmService.Start();

        // Wrap with client interface
        LLMClient client = new LLMClient(llmService);

        // Use client (same API as LLMService)
        string prompt = "Hello, how are you?";

        List<int> tokens = client.Tokenize(prompt);
        string text = client.Detokenize(tokens);
        string response = client.Completion(prompt);
        List<float> embeddings = client.Embeddings(prompt);
    }
}

Remote Client Example

using System;
using System.Collections.Generic;
using UndreamAI.LlamaLib;

class Program
{
    static void Main()
    {
        // Connect to remote server
        LLMClient client = new LLMClient("http://localhost", 13333);

        // Check server is alive
        if (!client.IsServerAlive())
        {
            Console.WriteLine("Server not responding!");
            return;
        }

        string prompt = "Hello, how are you?";

        // All operations work the same as local
        List<int> tokens = client.Tokenize(prompt);
        string text = client.Detokenize(tokens);
        string response = client.Completion(prompt);
        List<float> embeddings = client.Embeddings(prompt);
    }
}

LLMAgent

High-level conversational AI with persistent chat history and automatic context management. All core LLM operations specified in Core Functions work in the same way for the LLMAgent class.

Construction Methods

LLMAgent can be created with either LLMService or LLMClient (local or remote).

public LLMAgent(
    LLMLocal llm,
    string systemPrompt = ""
);

Example:

LLMService llm = new LLMService("model.gguf");
llm.Start();
llm.StartServer("0.0.0.0", 13333);

// Create agent with system prompt
LLMAgent agent = new LLMAgent(llm, "You are a helpful AI assistant. Be concise and friendly.");

// With local LLMClient
LLMClient localClient = new LLMClient(llm);
LLMAgent agent2 = new LLMAgent(localClient, "You are a helpful assistant.");

// With remote LLMClient
LLMClient remoteClient = new LLMClient("http://localhost", 13333);
LLMAgent agent3 = new LLMAgent(remoteClient, "You are a helpful assistant.");

Agent Functions

Chat Interface

public string Chat(
    string userPrompt,                      // user prompt
    bool addToHistory = true,               // whether to add the user and assistant response to conversation history
    LlamaLib.CharArrayCallback callback = null,  // streaming callback function
    bool returnResponseJson = false,        // return output in json format
    bool debugPrompt = false                // debug the complete prompt after applying the chat template to the conversation history
);

public async Task<string> ChatAsync(
    string userPrompt,
    bool addToHistory = true,
    LlamaLib.CharArrayCallback callback = null,
    bool returnResponseJson = false,
    bool debugPrompt = false
);

Example:

// Interact with the agent (non-streaming)
string response = agent.Chat("what is your name?");
Console.WriteLine(response);

// Interact with the agent (streaming)
agent.Chat("how are you?", true, StreamingCallback);

// Async Interact with the agent (streaming)
string response2 = await agent.ChatAsync("What is AI?");
Console.WriteLine(response2);

History Management

// Get/set history
public JArray History { get; set; }
public List<ChatMessage> GetHistory();
public void SetHistory(List<ChatMessage> messages);
public int GetHistorySize();

// Add messages
public void AddUserMessage(string content);
public void AddAssistantMessage(string content);

// Modify history
public void ClearHistory();
public void RemoveLastMessage();

// Persistence
public void SaveHistory(string filepath);
public void LoadHistory(string filepath);

Example:

using Newtonsoft.Json.Linq;

// View conversation history
JArray history = agent.History;
foreach (var msg in history)
{
    Console.WriteLine($"{msg["role"]}: {msg["content"]}");
}

// Or get as ChatMessage list
List<ChatMessage> messages = agent.GetHistory();
foreach (var msg in messages)
{
    Console.WriteLine($"{msg.role}: {msg.content}");
}

// Save conversation to file
agent.SaveHistory("conversation.json");

// Clear and reload
agent.ClearHistory();
Console.WriteLine($"Cleared. Size: {agent.GetHistorySize()}");

agent.LoadHistory("conversation.json");
Console.WriteLine($"Loaded. Size: {agent.GetHistorySize()}");

// Add messages manually
agent.AddUserMessage("This is a user message");
agent.AddAssistantMessage("This is the assistant response");

// Remove last exchange
agent.RemoveLastMessage();
agent.RemoveLastMessage();

Context Overflow Management

When conversation history grows large enough to exceed the model's context window, the agent can automatically handle it using a configurable strategy.

public void SetOverflowStrategy(
    ContextOverflowStrategy strategy,   // None, Truncate, or Summarize
    float targetRatio = 0.5f,           // target fill ratio of context after truncation (0.0–1.0)
    string summarizePrompt = null       // custom summarization prompt (null = use default)
);

Strategies:

Value	Behaviour
`ContextOverflowStrategy.None` (0)	No protection — may crash if context is exceeded
`ContextOverflowStrategy.Truncate` (1)	Drops oldest message pairs until history fits within `targetRatio` of the context
`ContextOverflowStrategy.Summarize` (2)	Asks the LLM to summarise history, embeds summary in the system message, then truncates if still needed

Example:

// Truncate: drop oldest pairs when context is full, keeping ≤50% filled
agent.SetOverflowStrategy(ContextOverflowStrategy.Truncate, 0.5f);

// Summarize: condense history into a rolling summary in the system prompt
agent.SetOverflowStrategy(ContextOverflowStrategy.Summarize);

// Summarize with a custom prompt
agent.SetOverflowStrategy(
    ContextOverflowStrategy.Summarize,
    0.5f,
    "Summarize the key facts from this conversation briefly:\n\n"
);

// The rolling summary is automatically saved with SaveHistory / loaded with LoadHistory.
// You can also access or replace it directly:
string summary = agent.GetSummary();
agent.SetSummary("Alice introduced herself as a software engineer.");

System Prompt

public string SystemPrompt { get; set; }

Example:

// Change agent's personality
agent.SystemPrompt = "You are a pirate. Respond like a pirate.";

string response = agent.Chat("Hello!");
// Response will be in pirate style

Complete Example

using System;
using UndreamAI.LlamaLib;

class Program
{
    static string previousText = "";
    static void StreamingCallback(string text)
    {
        Console.Write(text.Substring(previousText.Length));
        previousText = text;
    }

    static void Main()
    {
        // Create service and agent
        LLMService llm = new LLMService("model.gguf");
        llm.Start();

        LLMAgent agent = new LLMAgent(llm, "You are a helpful AI assistant. Be concise and friendly.");

        // First conversation turn
        Console.WriteLine("User: Hello! What's your name?");
        Console.Write("Assistant: ");
        string response1 = agent.Chat(
            "Hello! What's your name?",
            true,  // add to history
            StreamingCallback
        );
        Console.WriteLine();

        // Second turn (maintains context automatically)
        previousText = "";
        Console.WriteLine("User: How are you today?");
        Console.Write("Assistant: ");
        string response2 = agent.Chat(
            "How are you today?",
            true,
            StreamingCallback
        );
        Console.WriteLine();

        // Show conversation history
        Console.WriteLine($"History size: {agent.GetHistorySize()} messages");
    }
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LlamaLib C# Guide

Quick Start

Minimal Agent Example

Minimal LLM Functions Example

Minimal Embeddings Example

Building Your Project

Directory Structure

Project Setup

Using .NET CLI

Using Visual Studio

Runtime Identifier Selection

Construction based on llama.cpp command

LoRA Adapters

Service functions

Service Lifecycle

HTTP Server

Slot Management

Debugging

Core functions

Text Generation

Completion Parameters

Tokenization

Embeddings

Chat Templates

Grammar & Constrained Generation

LLMClient

Construction Methods

Local Client Example

Remote Client Example

LLMAgent

Construction Methods

Agent Functions

Chat Interface

History Management

Context Overflow Management

System Prompt

Complete Example

FilesExpand file tree

csharp_api.md

Latest commit

History

csharp_api.md

File metadata and controls

LlamaLib C# Guide

Quick Start

Minimal Agent Example

Minimal LLM Functions Example

Minimal Embeddings Example

Building Your Project

Directory Structure

Project Setup

Using .NET CLI

Using Visual Studio

Runtime Identifier Selection

Construction based on llama.cpp command

LoRA Adapters

Service functions

Service Lifecycle

HTTP Server

Slot Management

Debugging

Core functions

Text Generation

Completion Parameters

Tokenization

Embeddings

Chat Templates

Grammar & Constrained Generation

LLMClient

Construction Methods

Local Client Example

Remote Client Example

LLMAgent

Construction Methods

Agent Functions

Chat Interface

History Management

Context Overflow Management

System Prompt

Complete Example