| layout | default |
|---|---|
| title | LocalAI Tutorial - Chapter 1: Getting Started |
| nav_order | 1 |
| has_children | false |
| parent | LocalAI Tutorial |
Welcome to Chapter 1: Getting Started with LocalAI. In this part of LocalAI Tutorial: Self-Hosted OpenAI Alternative, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Install LocalAI, run your first model, and make your initial API call to the OpenAI-compatible endpoint.
flowchart TD
CLIENT[OpenAI SDK or curl] -->|HTTP /v1/...| LOCALAI[LocalAI Server :8080]
LOCALAI --> ROUTER[Model Router]
ROUTER --> LLM[LLM Backend\nllama.cpp, gpt4all]
ROUTER --> IMG[Image Backend\nStable Diffusion]
ROUTER --> AUDIO[Audio Backend\nWhisper, TTS]
ROUTER --> EMB[Embedding Backend\nSentence Transformers]
LOCALAI --> GALLERY[Model Gallery\nauto-download + config]
LocalAI provides a drop-in replacement for OpenAI's API that runs entirely on your local machine. This chapter covers installation, basic setup, and your first local AI inference.
- RAM: Minimum 4GB, recommended 8GB+
- Storage: 2GB free space for models
- CPU: x64 architecture (ARM64 supported)
- GPU: Optional, NVIDIA/AMD GPUs supported
- Docker: Recommended installation method
- curl: For API testing
- Python: Optional, for SDK usage
# Pull and run LocalAI CPU image
docker run -p 8080:8080 \
-v localai-models:/models \
-e DEBUG=true \
localai/localai:latest-cpu
# Access at http://localhost:8080# Ensure NVIDIA drivers are installed
nvidia-smi
# Run with GPU support
docker run -p 8080:8080 \
--gpus all \
-v localai-models:/models \
-e DEBUG=true \
localai/localai:latest-gpu-nvidia-cuda-12# For AMD GPUs
docker run -p 8080:8080 \
--device /dev/kfd --device /dev/dri \
-v localai-models:/models \
localai/localai:latest-gpu-amd-hipCreate docker-compose.yml:
version: '3.8'
services:
localai:
image: localai/localai:latest-cpu # or gpu version
ports:
- "8080:8080"
volumes:
- ./models:/models:cached # Mount local models directory
- ./config:/config # Optional config directory
environment:
- DEBUG=true
- THREADS=4 # CPU threads
- MODELS_PATH=/models # Models directory
- GALLERY_MODELS_PATH=/models/gallery
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/readyz"]
interval: 30s
timeout: 10s
retries: 3Run with:
docker-compose up -d# Clone repository
git clone https://github.com/mudler/LocalAI
cd LocalAI
# Build with Go
go build -o localai .
# Create models directory
mkdir models
# Run server
./localai --models-path ./models --debugVerify LocalAI is running:
# Check if service is responding
curl http://localhost:8080/readyz
# Should return: {"status":"ready"}LocalAI can automatically download models from HuggingFace or use local files.
# Install a small, fast model for testing
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{
"id": "phi-2",
"name": "phi-2"
}'
# Check installation status
curl http://localhost:8080/models/jobs/phi-2# Create models directory
mkdir -p models
# Download a GGUF model (example: Phi-2)
cd models
wget https://huggingface.co/microsoft/phi-2/resolve/main/model-00001-of-00002.safetensors
wget https://huggingface.co/microsoft/phi-2/resolve/main/model-00002-of-00002.safetensors
wget https://huggingface.co/microsoft/phi-2/resolve/main/tokenizer.json
wget https://huggingface.co/microsoft/phi-2/resolve/main/tokenizer_config.json
# Or download a GGUF version
wget https://huggingface.co/TheBloke/phi-2-GGUF/resolve/main/phi-2.Q4_K_M.ggufIf you have existing GGUF models:
# Copy to models directory
cp /path/to/your/model.gguf ./models/
# Create model configuration
cat > ./models/phi-2.yaml << EOF
name: phi-2
backend: llama
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.7
top_p: 0.9
top_k: 40
EOF# Test chat completion
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-2",
"messages": [
{
"role": "user",
"content": "Hello! Can you tell me a joke?"
}
],
"temperature": 0.7,
"max_tokens": 150
}'from openai import OpenAI
# Initialize client pointing to LocalAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # LocalAI doesn't require authentication
)
# Make a chat completion
response = client.chat.completions.create(
model="phi-2",
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
temperature=0.7,
max_tokens=100
)
print("Assistant:", response.choices[0].message.content)const OpenAI = require('openai');
const client = new OpenAI({
baseURL: 'http://localhost:8080/v1',
apiKey: 'not-needed'
});
async function chat() {
const response = await client.chat.completions.create({
model: 'phi-2',
messages: [
{ role: 'user', content: 'Explain quantum computing simply' }
],
temperature: 0.7
});
console.log(response.choices[0].message.content);
}
chat();Try various models available in the gallery:
# Small and fast models
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "tinyllama"}'
# Larger models (requires more RAM)
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "mistral-7b-instruct"}'# Check if LocalAI is running
curl http://localhost:8080/readyz
# Check Docker logs
docker logs localai-container
# Test basic connectivity
curl http://localhost:8080/v1/models# Check model directory permissions
ls -la models/
# Check available disk space
df -h
# Verify model file integrity
file models/your-model.gguf
# Check LocalAI logs for errors
docker logs localai-container 2>&1 | grep -i error# Check system resources
top # CPU usage
free -h # Memory usage
nvidia-smi # GPU usage (if applicable)
# Adjust thread count
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{
"id": "phi-2",
"parameters": {
"threads": 4
}
}'# Use smaller models
curl -X POST http://localhost:8080/models/apply \
-H "Content-Type: application/json" \
-d '{"id": "phi-2"}' # Smaller than 7B models
# Reduce context size
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "phi-2",
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 50,
"context_window": 512
}'# Docker logs
docker logs -f localai-container
# Follow logs in real-time
docker logs -f localai-container 2>&1 | grep -E "(INFO|ERROR|WARN)"# Get system metrics
curl http://localhost:8080/metrics
# Check model status
curl http://localhost:8080/v1/models
# Monitor resource usage
watch -n 1 'ps aux | grep localai'Create a LocalAI configuration file for advanced settings:
# config.yaml
debug: true
threads: 4
models_path: /models
context_size: 2048
f16: false # Use f32 for compatibility
model_library:
- url: "https://raw.githubusercontent.com/mudler/LocalAI/master/gallery/index.yaml"
name: "model-gallery"
preload_models:
- name: phi-2
parameters:
model: phi-2.Q4_K_M.gguf
temperature: 0.7
top_p: 0.9
top_k: 40Now that you have LocalAI running with your first model, let's explore the model gallery and installation options in the next chapter.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
def chat_with_ai():
messages = [{"role": "system", "content": "You are a helpful assistant."}]
while True:
user_input = input("You: ")
if user_input.lower() in ['quit', 'exit']:
break
messages.append({"role": "user", "content": user_input})
response = client.chat.completions.create(
model="phi-2",
messages=messages,
max_tokens=150
)
ai_response = response.choices[0].message.content
print(f"AI: {ai_response}")
messages.append({"role": "assistant", "content": ai_response})
if __name__ == "__main__":
chat_with_ai()This setup gives you a fully functional local AI server that can replace OpenAI API calls in your applications. The next chapter will show you how to install more models and manage your model collection.
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for models, localai, http so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 1: Getting Started with LocalAI as an operating subsystem inside LocalAI Tutorial: Self-Hosted OpenAI Alternative, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around localhost, model, curl as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 1: Getting Started with LocalAI usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
models. - Input normalization: shape incoming data so
localaireceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
http. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
-
core/http/app.goEntry point for the LocalAI HTTP server built on Go Fiber. Registers all API routes including/v1/chat/completions,/v1/completions,/v1/images/generations, and health endpoints. This is where OpenAI compatibility is wired in. -
core/config/application_config.goApplicationConfigstruct holding runtime configuration: model directory, address/port, backend concurrency limits, gallery URLs, and feature flags. This is what--models-path,--address, and environment variables map to. -
core/startup/startup.goServer initialization sequence: loads application config, initializes backend pools, discovers model files, loads gallery index, and starts the HTTP server. Tracing this file gives a complete picture of the startup process. -
MakefileBuild targets includingmake build,make docker, and backend-specific targets. Shows which C/C++ backends (llama.cpp, whisper, stable-diffusion) are compiled in and what GPU acceleration flags are used.
Suggested trace strategy:
- Start at
core/startup/startup.goto trace initialization sequence from config loading to HTTP server ready - Check
core/http/app.goroute registration to confirm which OpenAI API endpoints are supported - Review
core/config/application_config.gofields to understand all available environment variable overrides