[Feature]: Optional local LLM setup for coding agents via LiteLLM proxy

### Feature Description
Improve the developer tool setup by adding the option to use self-hosted OSS models (behind a proxy for easy swapping) instead of relying purely on Cursor's default models. This feature must be entirely optional for developers who wish to use it.

The output should be contained within a new subfolder under the `tools/` directory (e.g., `tools/local-llm-proxy/`).

### Why Is This Needed?
- **Reduce costs**: Self-hosted models can lower API usage expenses.
- **Improve resiliency**: Less dependence on external service uptime.
- **Configurable**: Easier to replace tools and models without changing the IDE setup.

### Requirements
1. **Configurability**: All important configuration settings must be configurable via environment variables (e.g., `.env`) or similar configuration files (e.g., `.yaml`).
2. **Optional**: The setup should not interfere with the default workflow for developers who do not want to use it.
3. **Directory Structure**: All related files (Docker Compose, config files) should be placed in a dedicated subfolder under the `tools/` directory.

### Acceptance Criteria (V1 MVP)
For the first version of this feature, the focus is purely on the MVP to provide basic, functional capabilities. Secondary or advanced options can be skipped for now.
1. **Basic Proxy Setup**: A functional `docker-compose.yml` containing just LiteLLM to serve as the proxy.
2. **Configuration**: `litellm-config.yaml` is present, configuring a specific local model (e.g., pointing to `host.docker.internal:11434`) and securing it via a static key.
3. **Environment Setup**: A `.env.example` file with dummy values for the proxy key and configurable external-facing ports, ensuring `.env` is `.gitignore`'d. At a minimum, the external facing ports for the proxy (e.g., port 4000) must be configurable via `.env` to avoid port conflicts on the developer's machine.
4. **Primary Model Loader**: A functional `load-model.sh` script that pulls and serves the recommended 4-bit quantized model to the native macOS Ollama instance (target: MacBook Pro M2, 32GB RAM).
5. **Basic Documentation**: A simple `README.md` instructing the developer on how to use `load-model.sh`, start LiteLLM via Docker Compose, and configure Cursor to connect to it.

*(Note: Advanced features like Ngrok/Cloudflare tunneling, Uptime Kuma monitoring, Docker profiles with init sidecars for Ollama, and SQLite token logging are explicitly out of scope for V1 and can be added in later iterations).*

### Suggested Solutions
Use a setup involving LiteLLM (secured with a static API key), Ngrok (for tunneling, if needed), and optionally Uptime Kuma for monitoring. 

#### 1. Configuration (`tools/local-llm-proxy/litellm-config.yaml`)
Force requests from Cursor to provide a specific key.
```yaml
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: openai/qwen3-coder-30b-a3b
      api_base: http://host.docker.internal
      api_key: "not-used" # Internal key for local model
      general_settings:
  master_key: ${CURSOR_TO_LITELLM_KEY} # The key Cursor must send
```

#### 2. Environment Variables (`.env` or `tools/local-llm-proxy/.env`)
```dotenv
# Proxy/Tunnel configuration
NGROK_AUTHTOKEN=your_ngrok_token
KUMA_USER=admin
KUMA_PASS=your_kuma_password

# Ports Configuration
LITELLM_PORT=4000
KUMA_PORT=3001

# Create a unique key for Cursor to use
CURSOR_TO_LITELLM_KEY=sk-local-agent-secure-12345
```

#### 3. Docker Compose (`tools/local-llm-proxy/docker-compose.yml`)
Integrate LiteLLM, Ngrok, Uptime Kuma, and AutoKuma using environment variables for sensitive data.

```yaml
services:
  litellm:
    image: ghcr.io/berriai/litellm:main-latest
    container_name: litellm-proxy
    volumes:
      - ./litellm-config.yaml:/app/config.yaml
    ports:
      - "${LITELLM_PORT:-4000}:4000"
    environment:
      - CURSOR_TO_LITELLM_KEY=${CURSOR_TO_LITELLM_KEY}
    command: ["--config", "/app/config.yaml", "--port", "4000"]
    extra_hosts:
      - "host.docker.internal:host-gateway"
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:4000/health/readiness || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 5
    labels:
      kuma.litellm.http.name: "LiteLLM Proxy"
      kuma.litellm.http.url: "http://litellm:4000/health/readiness"

  ngrok:
    image: ngrok/ngrok:latest
    container_name: ngrok-tunnel
    environment:
      - NGROK_AUTHTOKEN=${NGROK_AUTHTOKEN}
    command: ["http", "litellm:4000"]
    depends_on:
      litellm:
        condition: service_healthy
    labels:
      kuma.ngrok.http.name: "Ngrok Tunnel"
      kuma.ngrok.http.url: "http://ngrok:4040/api/tunnels"

  uptime-kuma:
    image: louislam/uptime-kuma:1
    container_name: uptime-kuma
    volumes:
      - ./uptime-kuma-data:/app/data
      - /var/run/docker.sock:/var/run/docker.sock:ro
    ports:
      - "127.0.0.1:${KUMA_PORT:-3001}:3001"
    restart: always

  autokuma:
    image: ghcr.io/bigboot/autokuma:latest
    container_name: autokuma-sidecar
    environment:
      - AUTOKUMA__KUMA__URL=http://uptime-kuma:3001
      - AUTOKUMA__KUMA__USERNAME=${KUMA_USER}
      - AUTOKUMA__KUMA__PASSWORD=${KUMA_PASS}
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    depends_on:
      - uptime-kuma
```

#### How to use this in Cursor
1. Navigate to `tools/local-llm-proxy` and run: `docker compose up -d`.
2. Get URL: Run `curl http://localhost:4040/api/tunnels` to find your public Ngrok address.
3. In Cursor Settings:
   - Add model: `gpt-4o`.
   - Override OpenAI Base URL: Your Proxy URL + `/v1`.
   - API Key: Use the value of `CURSOR_TO_LITELLM_KEY`.

---

### Additional Suggestions for Robustness

To make this setup more developer-friendly and secure, we should also include the following enhancements to the `tools/local-llm-proxy/` scope:

1. **Tunneling Alternatives**:
   - While Ngrok is a good start, the free tier rotates the public URL on restart. We should document or provide alternative configurations for **Cloudflare Tunnels** (persistent free URLs) or **Tailscale** (mesh VPN for secure, local-only access without exposing to the public internet).

2. **Configuration Safety**:
   - Create a `tools/local-llm-proxy/.env.example` file with dummy values so developers know exactly what to configure.
   - Ensure `tools/local-llm-proxy/.env` is explicitly added to the project's `.gitignore` to prevent leaking the proxy keys (`CURSOR_TO_LITELLM_KEY` and `NGROK_AUTHTOKEN`).

3. **Dedicated Documentation**:
   - Add a `README.md` inside `tools/local-llm-proxy/` detailing:
     - Prerequisites (Docker, local model runners like Ollama/LM Studio).
     - Step-by-step setup instructions.
     - OS-specific troubleshooting (e.g., `host.docker.internal` nuances on Linux).

4. **Model Loading Architecture & Best Performance (MacBook Pro M2)**:
   - **Important Hardware Note:** For developers using a **MacBook Pro M2 with 32 GB of RAM**, the absolute best performance is achieved by running the **native macOS Ollama app** rather than running Ollama inside Docker. This gives the model direct, un-virtualized access to Apple's Metal API and Unified Memory.
   - **Quantization Impact:** Using 4-bit or 8-bit quantized models significantly reduces the model's VRAM footprint with minimal degradation in coding capability. To leave enough RAM for the IDE and other tools, use models sized around **14B to 32B parameters** with **4-bit quantization** (requiring 8GB to 20GB of RAM).
   - **The Mixed Loading Approach**:
     1. **`load-model.sh` (Primary)**: Provide a dedicated shell script to automatically fetch and serve the right quantized model. This handles both native Ollama installations (checking port 11434 on the host) and Docker installations seamlessly.
     2. **Docker Profile (Secondary)**: For Linux/Windows users who want an "everything in Docker" setup, provide an optional `docker-compose.yml` profile with an Ollama container and an `ollama-init` sidecar that auto-pulls the model when the container boots.
     3. LiteLLM proxy remains in Docker for everyone, pointing to `http://host.docker.internal:11434`.

5. **Token Logging and Cost Tracking**:
   - Configure LiteLLM's built-in observability to log requests locally (e.g., to a local SQLite DB or simple dashboard) so developers can track their token usage and estimate cost savings compared to the paid API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Optional local LLM setup for coding agents via LiteLLM proxy #61

Feature Description

Why Is This Needed?

Requirements

Acceptance Criteria (V1 MVP)

Suggested Solutions

1. Configuration (`tools/local-llm-proxy/litellm-config.yaml`)

2. Environment Variables (`.env` or `tools/local-llm-proxy/.env`)

3. Docker Compose (`tools/local-llm-proxy/docker-compose.yml`)

How to use this in Cursor

Additional Suggestions for Robustness

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[Feature]: Optional local LLM setup for coding agents via LiteLLM proxy #61

Description

Feature Description

Why Is This Needed?

Requirements

Acceptance Criteria (V1 MVP)

Suggested Solutions

1. Configuration (tools/local-llm-proxy/litellm-config.yaml)

2. Environment Variables (.env or tools/local-llm-proxy/.env)

3. Docker Compose (tools/local-llm-proxy/docker-compose.yml)

How to use this in Cursor

Additional Suggestions for Robustness

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Configuration (`tools/local-llm-proxy/litellm-config.yaml`)

2. Environment Variables (`.env` or `tools/local-llm-proxy/.env`)

3. Docker Compose (`tools/local-llm-proxy/docker-compose.yml`)