Skip to content

[Feature]: Optional local LLM setup for coding agents via LiteLLM proxy #61

@se02035

Description

@se02035

Feature Description

Improve the developer tool setup by adding the option to use self-hosted OSS models (behind a proxy for easy swapping) instead of relying purely on Cursor's default models. This feature must be entirely optional for developers who wish to use it.

The output should be contained within a new subfolder under the tools/ directory (e.g., tools/local-llm-proxy/).

Why Is This Needed?

  • Reduce costs: Self-hosted models can lower API usage expenses.
  • Improve resiliency: Less dependence on external service uptime.
  • Configurable: Easier to replace tools and models without changing the IDE setup.

Requirements

  1. Configurability: All important configuration settings must be configurable via environment variables (e.g., .env) or similar configuration files (e.g., .yaml).
  2. Optional: The setup should not interfere with the default workflow for developers who do not want to use it.
  3. Directory Structure: All related files (Docker Compose, config files) should be placed in a dedicated subfolder under the tools/ directory.

Acceptance Criteria (V1 MVP)

For the first version of this feature, the focus is purely on the MVP to provide basic, functional capabilities. Secondary or advanced options can be skipped for now.

  1. Basic Proxy Setup: A functional docker-compose.yml containing just LiteLLM to serve as the proxy.
  2. Configuration: litellm-config.yaml is present, configuring a specific local model (e.g., pointing to host.docker.internal:11434) and securing it via a static key.
  3. Environment Setup: A .env.example file with dummy values for the proxy key and configurable external-facing ports, ensuring .env is .gitignore'd. At a minimum, the external facing ports for the proxy (e.g., port 4000) must be configurable via .env to avoid port conflicts on the developer's machine.
  4. Primary Model Loader: A functional load-model.sh script that pulls and serves the recommended 4-bit quantized model to the native macOS Ollama instance (target: MacBook Pro M2, 32GB RAM).
  5. Basic Documentation: A simple README.md instructing the developer on how to use load-model.sh, start LiteLLM via Docker Compose, and configure Cursor to connect to it.

(Note: Advanced features like Ngrok/Cloudflare tunneling, Uptime Kuma monitoring, Docker profiles with init sidecars for Ollama, and SQLite token logging are explicitly out of scope for V1 and can be added in later iterations).

Suggested Solutions

Use a setup involving LiteLLM (secured with a static API key), Ngrok (for tunneling, if needed), and optionally Uptime Kuma for monitoring.

1. Configuration (tools/local-llm-proxy/litellm-config.yaml)

Force requests from Cursor to provide a specific key.

model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/qwen3-coder-30b-a3b
      api_base: http://host.docker.internal
      api_key: "not-used" # Internal key for local model
      general_settings:
  master_key: ${CURSOR_TO_LITELLM_KEY} # The key Cursor must send

2. Environment Variables (.env or tools/local-llm-proxy/.env)

# Proxy/Tunnel configuration
NGROK_AUTHTOKEN=your_ngrok_token
KUMA_USER=admin
KUMA_PASS=your_kuma_password

# Ports Configuration
LITELLM_PORT=4000
KUMA_PORT=3001

# Create a unique key for Cursor to use
CURSOR_TO_LITELLM_KEY=sk-local-agent-secure-12345

3. Docker Compose (tools/local-llm-proxy/docker-compose.yml)

Integrate LiteLLM, Ngrok, Uptime Kuma, and AutoKuma using environment variables for sensitive data.

services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    ports:
      - "${LITELLM_PORT:-4000}:4000"
    environment:
      - CURSOR_TO_LITELLM_KEY=${CURSOR_TO_LITELLM_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    extra_hosts:
      - "host.docker.internal:host-gateway"
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:4000/health/readiness || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 5
    labels:
      kuma.litellm.http.name: "LiteLLM Proxy"
      kuma.litellm.http.url: "http://litellm:4000/health/readiness"

  ngrok:
    image: ngrok/ngrok:latest
    container_name: ngrok-tunnel
    environment:
      - NGROK_AUTHTOKEN=${NGROK_AUTHTOKEN}
    command: ["http", "litellm:4000"]
    depends_on:
      litellm:
        condition: service_healthy
    labels:
      kuma.ngrok.http.name: "Ngrok Tunnel"
      kuma.ngrok.http.url: "http://ngrok:4040/api/tunnels"

  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: uptime-kuma
    volumes:
      - ./uptime-kuma-data:/app/data
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "127.0.0.1:${KUMA_PORT:-3001}:3001"
    restart: always

  autokuma:
    image: ghcr.io/bigboot/autokuma:latest
    container_name: autokuma-sidecar
    environment:
      - AUTOKUMA__KUMA__URL=http://uptime-kuma:3001
      - AUTOKUMA__KUMA__USERNAME=${KUMA_USER}
      - AUTOKUMA__KUMA__PASSWORD=${KUMA_PASS}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - uptime-kuma

How to use this in Cursor

  1. Navigate to tools/local-llm-proxy and run: docker compose up -d.
  2. Get URL: Run curl http://localhost:4040/api/tunnels to find your public Ngrok address.
  3. In Cursor Settings:
    • Add model: gpt-4o.
    • Override OpenAI Base URL: Your Proxy URL + /v1.
    • API Key: Use the value of CURSOR_TO_LITELLM_KEY.

Additional Suggestions for Robustness

To make this setup more developer-friendly and secure, we should also include the following enhancements to the tools/local-llm-proxy/ scope:

  1. Tunneling Alternatives:

    • While Ngrok is a good start, the free tier rotates the public URL on restart. We should document or provide alternative configurations for Cloudflare Tunnels (persistent free URLs) or Tailscale (mesh VPN for secure, local-only access without exposing to the public internet).
  2. Configuration Safety:

    • Create a tools/local-llm-proxy/.env.example file with dummy values so developers know exactly what to configure.
    • Ensure tools/local-llm-proxy/.env is explicitly added to the project's .gitignore to prevent leaking the proxy keys (CURSOR_TO_LITELLM_KEY and NGROK_AUTHTOKEN).
  3. Dedicated Documentation:

    • Add a README.md inside tools/local-llm-proxy/ detailing:
      • Prerequisites (Docker, local model runners like Ollama/LM Studio).
      • Step-by-step setup instructions.
      • OS-specific troubleshooting (e.g., host.docker.internal nuances on Linux).
  4. Model Loading Architecture & Best Performance (MacBook Pro M2):

    • Important Hardware Note: For developers using a MacBook Pro M2 with 32 GB of RAM, the absolute best performance is achieved by running the native macOS Ollama app rather than running Ollama inside Docker. This gives the model direct, un-virtualized access to Apple's Metal API and Unified Memory.
    • Quantization Impact: Using 4-bit or 8-bit quantized models significantly reduces the model's VRAM footprint with minimal degradation in coding capability. To leave enough RAM for the IDE and other tools, use models sized around 14B to 32B parameters with 4-bit quantization (requiring 8GB to 20GB of RAM).
    • The Mixed Loading Approach:
      1. load-model.sh (Primary): Provide a dedicated shell script to automatically fetch and serve the right quantized model. This handles both native Ollama installations (checking port 11434 on the host) and Docker installations seamlessly.
      2. Docker Profile (Secondary): For Linux/Windows users who want an "everything in Docker" setup, provide an optional docker-compose.yml profile with an Ollama container and an ollama-init sidecar that auto-pulls the model when the container boots.
      3. LiteLLM proxy remains in Docker for everyone, pointing to http://host.docker.internal:11434.
  5. Token Logging and Cost Tracking:

    • Configure LiteLLM's built-in observability to log requests locally (e.g., to a local SQLite DB or simple dashboard) so developers can track their token usage and estimate cost savings compared to the paid API.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions