Feature Description
Improve the developer tool setup by adding the option to use self-hosted OSS models (behind a proxy for easy swapping) instead of relying purely on Cursor's default models. This feature must be entirely optional for developers who wish to use it.
The output should be contained within a new subfolder under the tools/ directory (e.g., tools/local-llm-proxy/).
Why Is This Needed?
- Reduce costs: Self-hosted models can lower API usage expenses.
- Improve resiliency: Less dependence on external service uptime.
- Configurable: Easier to replace tools and models without changing the IDE setup.
Requirements
- Configurability: All important configuration settings must be configurable via environment variables (e.g.,
.env) or similar configuration files (e.g., .yaml).
- Optional: The setup should not interfere with the default workflow for developers who do not want to use it.
- Directory Structure: All related files (Docker Compose, config files) should be placed in a dedicated subfolder under the
tools/ directory.
Acceptance Criteria (V1 MVP)
For the first version of this feature, the focus is purely on the MVP to provide basic, functional capabilities. Secondary or advanced options can be skipped for now.
- Basic Proxy Setup: A functional
docker-compose.yml containing just LiteLLM to serve as the proxy.
- Configuration:
litellm-config.yaml is present, configuring a specific local model (e.g., pointing to host.docker.internal:11434) and securing it via a static key.
- Environment Setup: A
.env.example file with dummy values for the proxy key and configurable external-facing ports, ensuring .env is .gitignore'd. At a minimum, the external facing ports for the proxy (e.g., port 4000) must be configurable via .env to avoid port conflicts on the developer's machine.
- Primary Model Loader: A functional
load-model.sh script that pulls and serves the recommended 4-bit quantized model to the native macOS Ollama instance (target: MacBook Pro M2, 32GB RAM).
- Basic Documentation: A simple
README.md instructing the developer on how to use load-model.sh, start LiteLLM via Docker Compose, and configure Cursor to connect to it.
(Note: Advanced features like Ngrok/Cloudflare tunneling, Uptime Kuma monitoring, Docker profiles with init sidecars for Ollama, and SQLite token logging are explicitly out of scope for V1 and can be added in later iterations).
Suggested Solutions
Use a setup involving LiteLLM (secured with a static API key), Ngrok (for tunneling, if needed), and optionally Uptime Kuma for monitoring.
1. Configuration (tools/local-llm-proxy/litellm-config.yaml)
Force requests from Cursor to provide a specific key.
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/qwen3-coder-30b-a3b
api_base: http://host.docker.internal
api_key: "not-used" # Internal key for local model
general_settings:
master_key: ${CURSOR_TO_LITELLM_KEY} # The key Cursor must send
2. Environment Variables (.env or tools/local-llm-proxy/.env)
# Proxy/Tunnel configuration
NGROK_AUTHTOKEN=your_ngrok_token
KUMA_USER=admin
KUMA_PASS=your_kuma_password
# Ports Configuration
LITELLM_PORT=4000
KUMA_PORT=3001
# Create a unique key for Cursor to use
CURSOR_TO_LITELLM_KEY=sk-local-agent-secure-12345
3. Docker Compose (tools/local-llm-proxy/docker-compose.yml)
Integrate LiteLLM, Ngrok, Uptime Kuma, and AutoKuma using environment variables for sensitive data.
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
container_name: litellm-proxy
volumes:
- ./litellm-config.yaml:/app/config.yaml
ports:
- "${LITELLM_PORT:-4000}:4000"
environment:
- CURSOR_TO_LITELLM_KEY=${CURSOR_TO_LITELLM_KEY}
command: ["--config", "/app/config.yaml", "--port", "4000"]
extra_hosts:
- "host.docker.internal:host-gateway"
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:4000/health/readiness || exit 1"]
interval: 10s
timeout: 5s
retries: 5
labels:
kuma.litellm.http.name: "LiteLLM Proxy"
kuma.litellm.http.url: "http://litellm:4000/health/readiness"
ngrok:
image: ngrok/ngrok:latest
container_name: ngrok-tunnel
environment:
- NGROK_AUTHTOKEN=${NGROK_AUTHTOKEN}
command: ["http", "litellm:4000"]
depends_on:
litellm:
condition: service_healthy
labels:
kuma.ngrok.http.name: "Ngrok Tunnel"
kuma.ngrok.http.url: "http://ngrok:4040/api/tunnels"
uptime-kuma:
image: louislam/uptime-kuma:1
container_name: uptime-kuma
volumes:
- ./uptime-kuma-data:/app/data
- /var/run/docker.sock:/var/run/docker.sock:ro
ports:
- "127.0.0.1:${KUMA_PORT:-3001}:3001"
restart: always
autokuma:
image: ghcr.io/bigboot/autokuma:latest
container_name: autokuma-sidecar
environment:
- AUTOKUMA__KUMA__URL=http://uptime-kuma:3001
- AUTOKUMA__KUMA__USERNAME=${KUMA_USER}
- AUTOKUMA__KUMA__PASSWORD=${KUMA_PASS}
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
depends_on:
- uptime-kuma
How to use this in Cursor
- Navigate to
tools/local-llm-proxy and run: docker compose up -d.
- Get URL: Run
curl http://localhost:4040/api/tunnels to find your public Ngrok address.
- In Cursor Settings:
- Add model:
gpt-4o.
- Override OpenAI Base URL: Your Proxy URL +
/v1.
- API Key: Use the value of
CURSOR_TO_LITELLM_KEY.
Additional Suggestions for Robustness
To make this setup more developer-friendly and secure, we should also include the following enhancements to the tools/local-llm-proxy/ scope:
-
Tunneling Alternatives:
- While Ngrok is a good start, the free tier rotates the public URL on restart. We should document or provide alternative configurations for Cloudflare Tunnels (persistent free URLs) or Tailscale (mesh VPN for secure, local-only access without exposing to the public internet).
-
Configuration Safety:
- Create a
tools/local-llm-proxy/.env.example file with dummy values so developers know exactly what to configure.
- Ensure
tools/local-llm-proxy/.env is explicitly added to the project's .gitignore to prevent leaking the proxy keys (CURSOR_TO_LITELLM_KEY and NGROK_AUTHTOKEN).
-
Dedicated Documentation:
- Add a
README.md inside tools/local-llm-proxy/ detailing:
- Prerequisites (Docker, local model runners like Ollama/LM Studio).
- Step-by-step setup instructions.
- OS-specific troubleshooting (e.g.,
host.docker.internal nuances on Linux).
-
Model Loading Architecture & Best Performance (MacBook Pro M2):
- Important Hardware Note: For developers using a MacBook Pro M2 with 32 GB of RAM, the absolute best performance is achieved by running the native macOS Ollama app rather than running Ollama inside Docker. This gives the model direct, un-virtualized access to Apple's Metal API and Unified Memory.
- Quantization Impact: Using 4-bit or 8-bit quantized models significantly reduces the model's VRAM footprint with minimal degradation in coding capability. To leave enough RAM for the IDE and other tools, use models sized around 14B to 32B parameters with 4-bit quantization (requiring 8GB to 20GB of RAM).
- The Mixed Loading Approach:
load-model.sh (Primary): Provide a dedicated shell script to automatically fetch and serve the right quantized model. This handles both native Ollama installations (checking port 11434 on the host) and Docker installations seamlessly.
- Docker Profile (Secondary): For Linux/Windows users who want an "everything in Docker" setup, provide an optional
docker-compose.yml profile with an Ollama container and an ollama-init sidecar that auto-pulls the model when the container boots.
- LiteLLM proxy remains in Docker for everyone, pointing to
http://host.docker.internal:11434.
-
Token Logging and Cost Tracking:
- Configure LiteLLM's built-in observability to log requests locally (e.g., to a local SQLite DB or simple dashboard) so developers can track their token usage and estimate cost savings compared to the paid API.
Feature Description
Improve the developer tool setup by adding the option to use self-hosted OSS models (behind a proxy for easy swapping) instead of relying purely on Cursor's default models. This feature must be entirely optional for developers who wish to use it.
The output should be contained within a new subfolder under the
tools/directory (e.g.,tools/local-llm-proxy/).Why Is This Needed?
Requirements
.env) or similar configuration files (e.g.,.yaml).tools/directory.Acceptance Criteria (V1 MVP)
For the first version of this feature, the focus is purely on the MVP to provide basic, functional capabilities. Secondary or advanced options can be skipped for now.
docker-compose.ymlcontaining just LiteLLM to serve as the proxy.litellm-config.yamlis present, configuring a specific local model (e.g., pointing tohost.docker.internal:11434) and securing it via a static key..env.examplefile with dummy values for the proxy key and configurable external-facing ports, ensuring.envis.gitignore'd. At a minimum, the external facing ports for the proxy (e.g., port 4000) must be configurable via.envto avoid port conflicts on the developer's machine.load-model.shscript that pulls and serves the recommended 4-bit quantized model to the native macOS Ollama instance (target: MacBook Pro M2, 32GB RAM).README.mdinstructing the developer on how to useload-model.sh, start LiteLLM via Docker Compose, and configure Cursor to connect to it.(Note: Advanced features like Ngrok/Cloudflare tunneling, Uptime Kuma monitoring, Docker profiles with init sidecars for Ollama, and SQLite token logging are explicitly out of scope for V1 and can be added in later iterations).
Suggested Solutions
Use a setup involving LiteLLM (secured with a static API key), Ngrok (for tunneling, if needed), and optionally Uptime Kuma for monitoring.
1. Configuration (
tools/local-llm-proxy/litellm-config.yaml)Force requests from Cursor to provide a specific key.
2. Environment Variables (
.envortools/local-llm-proxy/.env)3. Docker Compose (
tools/local-llm-proxy/docker-compose.yml)Integrate LiteLLM, Ngrok, Uptime Kuma, and AutoKuma using environment variables for sensitive data.
How to use this in Cursor
tools/local-llm-proxyand run:docker compose up -d.curl http://localhost:4040/api/tunnelsto find your public Ngrok address.gpt-4o./v1.CURSOR_TO_LITELLM_KEY.Additional Suggestions for Robustness
To make this setup more developer-friendly and secure, we should also include the following enhancements to the
tools/local-llm-proxy/scope:Tunneling Alternatives:
Configuration Safety:
tools/local-llm-proxy/.env.examplefile with dummy values so developers know exactly what to configure.tools/local-llm-proxy/.envis explicitly added to the project's.gitignoreto prevent leaking the proxy keys (CURSOR_TO_LITELLM_KEYandNGROK_AUTHTOKEN).Dedicated Documentation:
README.mdinsidetools/local-llm-proxy/detailing:host.docker.internalnuances on Linux).Model Loading Architecture & Best Performance (MacBook Pro M2):
load-model.sh(Primary): Provide a dedicated shell script to automatically fetch and serve the right quantized model. This handles both native Ollama installations (checking port 11434 on the host) and Docker installations seamlessly.docker-compose.ymlprofile with an Ollama container and anollama-initsidecar that auto-pulls the model when the container boots.http://host.docker.internal:11434.Token Logging and Cost Tracking: