Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
2aefbf6
Revert "chore: remove accidentally merged Telegram channel code"
gavrielc Mar 25, 2026
8463826
fix(telegram): support message_thread_id for topics
gavrielc Mar 25, 2026
85748ff
feat: download Telegram file attachments to group directory
Mar 10, 2026
1eebd6b
feat: pass Telegram reply/quoted message context to agent
leonalfredbot-ship-it Apr 2, 2026
21fe9e6
fix: persist reply context to DB and add tests
Apr 2, 2026
7907dad
feat(telegram): add voice message transcription via OpenAI Whisper
Apr 6, 2026
d12bab9
feat(transcription): use local whisper.cpp instead of OpenAI API
Apr 6, 2026
61bc375
chore: remove unused openai dependency
Apr 6, 2026
8c2c709
docs(skill): update use-local-whisper for Telegram + Linux support
Apr 6, 2026
985ac8c
docs(skill): update add-voice-transcription for Telegram + channel cl…
Apr 6, 2026
a75065a
Merge remote-tracking branch 'upstream/main'
Apr 8, 2026
a27ff4a
Merge branch 'qwibitai:main' into main
Saxin Apr 8, 2026
f8e5c48
fix: proactive OAuth token refresh to prevent inactivity 401s
Apr 8, 2026
c5a5b4c
feat(telegram): voice message transcription via local whisper.cpp
Apr 8, 2026
0afd4cd
feat: HA MCP server, mcp-remote in image, ha-direct-control skill, sy…
Apr 8, 2026
01bf2da
feat: add home-assistant-best-practices and skill-creator container s…
Apr 8, 2026
b57f8e4
style: apply prettier formatting to oauth-token and index
Apr 8, 2026
608ac1f
Merge remote-tracking branch 'origin/main'
Apr 8, 2026
5e61b28
fix(oauth): include client_id, scope, and rotating refresh token in t…
Apr 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions .claude/skills/add-voice-transcription/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
---
name: add-voice-transcription
description: Add voice message transcription to NanoClaw using OpenAI's Whisper API. Automatically transcribes WhatsApp voice notes so the agent can read and respond to them.
description: Add voice message transcription to NanoClaw using OpenAI's Whisper API. Automatically transcribes voice notes so the agent can read and respond to them. Supports Telegram and WhatsApp channels.
---

# Add Voice Transcription

This skill adds automatic voice message transcription to NanoClaw's WhatsApp channel using OpenAI's Whisper API. When a voice note arrives, it is downloaded, transcribed, and delivered to the agent as `[Voice: <transcript>]`.
This skill adds automatic voice message transcription to NanoClaw using OpenAI's Whisper API. When a voice note arrives, it is transcribed and delivered to the agent as `[Voice: <transcript>]`.

**Channel support:** Telegram and WhatsApp.
- **Telegram:** Built into the Telegram channel — no extra code changes needed if `src/transcription.ts` exists.
- **WhatsApp:** Requires the WhatsApp channel to be installed first (`skill/whatsapp` merged).

> **Prefer local transcription?** Use the `use-local-whisper` skill instead — no API key, no cost, fully on-device via whisper.cpp.

## Phase 1: Pre-flight

Expand All @@ -21,7 +27,9 @@ AskUserQuestion: Do you have an OpenAI API key for Whisper transcription?

If yes, collect it now. If no, direct them to create one at https://platform.openai.com/api-keys.

## Phase 2: Apply Code Changes
## Phase 2: Apply Code Changes (WhatsApp only)

Skip this phase if you are only setting up Telegram — `src/transcription.ts` already handles Telegram via `transcribeAudioBuffer(buffer, filename)`.

**Prerequisite:** WhatsApp must be installed first (`skill/whatsapp` merged). This skill modifies WhatsApp channel files.

Expand Down Expand Up @@ -49,7 +57,6 @@ git merge whatsapp/skill/voice-transcription || {
```

This merges in:
- `src/transcription.ts` (voice transcription module using OpenAI Whisper)
- Voice handling in `src/channels/whatsapp.ts` (isVoiceMessage check, transcribeAudioMessage call)
- Transcription tests in `src/channels/whatsapp.test.ts`
- `openai` npm dependency in `package.json`
Expand Down Expand Up @@ -105,7 +112,7 @@ The container reads environment from `data/env/env`, not `.env` directly.
```bash
npm run build
launchctl kickstart -k gui/$(id -u)/com.nanoclaw # macOS
# Linux: systemctl --user restart nanoclaw
# Linux: kill -TERM $(pgrep -f "nanoclaw/dist/index.js") # systemd restarts automatically
```

## Phase 4: Verify
Expand All @@ -114,7 +121,7 @@ launchctl kickstart -k gui/$(id -u)/com.nanoclaw # macOS

Tell the user:

> Send a voice note in any registered WhatsApp chat. The agent should receive it as `[Voice: <transcript>]` and respond to its content.
> Send a voice note in any registered chat. The agent should receive it as `[Voice: <transcript>]` and respond to its content.

### Check logs if needed

Expand Down
203 changes: 133 additions & 70 deletions .claude/skills/use-local-whisper/SKILL.md
Original file line number Diff line number Diff line change
@@ -1,152 +1,215 @@
---
name: use-local-whisper
description: Use when the user wants local voice transcription instead of OpenAI Whisper API. Switches to whisper.cpp running on Apple Silicon. WhatsApp only for now. Requires voice-transcription skill to be applied first.
description: Use when the user wants local voice transcription instead of OpenAI Whisper API. Switches to whisper.cpp running locally. Works for Telegram and WhatsApp channels. No API key, no network, no cost.
---

# Use Local Whisper

Switches voice transcription from OpenAI's Whisper API to local whisper.cpp. Runs entirely on-device — no API key, no network, no cost.

**Channel support:** Currently WhatsApp only. The transcription module (`src/transcription.ts`) uses Baileys types for audio download. Other channels (Telegram, Discord, etc.) would need their own audio-download logic before this skill can serve them.

**Note:** The Homebrew package is `whisper-cpp`, but the CLI binary it installs is `whisper-cli`.
**Channel support:** Telegram and WhatsApp. The transcription module (`src/transcription.ts`) exposes a generic `transcribeAudioBuffer(buffer, filename)` API — any channel that downloads audio can use it.

## Prerequisites

- `voice-transcription` skill must be applied first (WhatsApp channel)
- macOS with Apple Silicon (M1+) recommended
- `whisper-cpp` installed: `brew install whisper-cpp` (provides the `whisper-cli` binary)
- `ffmpeg` installed: `brew install ffmpeg`
- A GGML model file downloaded to `data/models/`
- `src/transcription.ts` must exist (created by the voice transcription feature)
- `whisper-cli` binary installed and in PATH
- `ffmpeg` installed
- A GGML model file at `data/models/ggml-base.bin` (or configured via `WHISPER_MODEL`)

## Phase 1: Pre-flight

### Check if already applied

Check if `src/transcription.ts` already uses `whisper-cli`:

```bash
grep 'whisper-cli' src/transcription.ts && echo "Already applied" || echo "Not applied"
```

If already applied, skip to Phase 3 (Verify).

### Check dependencies are installed
### Check dependencies

```bash
whisper-cli --help >/dev/null 2>&1 && echo "WHISPER_OK" || echo "WHISPER_MISSING"
ffmpeg -version >/dev/null 2>&1 && echo "FFMPEG_OK" || echo "FFMPEG_MISSING"
ls data/models/ggml-*.bin 2>/dev/null || echo "NO_MODEL"
```

If missing, install via Homebrew:
## Phase 2: Install Dependencies

### macOS (Apple Silicon)

```bash
brew install whisper-cpp ffmpeg
```

### Check for model file
The Homebrew package is `whisper-cpp` but the binary is `whisper-cli`.

```bash
ls data/models/ggml-*.bin 2>/dev/null || echo "NO_MODEL"
```
### Linux (Debian/Ubuntu)

If no model exists, download the base model (148MB, good balance of speed and accuracy):
```bash
mkdir -p data/models
curl -L -o data/models/ggml-base.bin "https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"
# System packages
sudo apt-get install -y ffmpeg build-essential cmake

# Build whisper.cpp from source
git clone https://github.com/ggml-org/whisper.cpp.git --depth=1 /tmp/whisper.cpp
cd /tmp/whisper.cpp
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

# Install binary (adjust destination to a directory in PATH)
cp build/bin/whisper-cli ~/.local/bin/whisper-cli
chmod +x ~/.local/bin/whisper-cli
```

For better accuracy at the cost of speed, use `ggml-small.bin` (466MB) or `ggml-medium.bin` (1.5GB).

## Phase 2: Apply Code Changes

### Ensure WhatsApp fork remote
### Download model

```bash
git remote -v
mkdir -p data/models
curl -L -o data/models/ggml-base.bin \
"https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin"
```

If `whatsapp` is missing, add it:

```bash
git remote add whatsapp https://github.com/qwibitai/nanoclaw-whatsapp.git
For better accuracy at the cost of speed: `ggml-small.bin` (466MB) or `ggml-medium.bin` (1.5GB).

## Phase 3: Apply Code Changes

Replace `src/transcription.ts` with the whisper.cpp implementation:

```typescript
import { execFile } from 'child_process';
import fs from 'fs';
import os from 'os';
import path from 'path';
import { promisify } from 'util';

import { logger } from './logger.js';

const execFileAsync = promisify(execFile);

const WHISPER_BIN = process.env.WHISPER_BIN || 'whisper-cli';
const WHISPER_MODEL =
process.env.WHISPER_MODEL ||
path.join(process.cwd(), 'data', 'models', 'ggml-base.bin');

export async function transcribeAudioBuffer(
buffer: Buffer,
filename: string,
): Promise<string | null> {
const tmpDir = os.tmpdir();
const id = `nanoclaw-voice-${Date.now()}`;
const ext = path.extname(filename) || '.ogg';
const tmpIn = path.join(tmpDir, `${id}${ext}`);
const tmpWav = path.join(tmpDir, `${id}.wav`);

try {
fs.writeFileSync(tmpIn, buffer);

await execFileAsync(
'ffmpeg',
['-i', tmpIn, '-ar', '16000', '-ac', '1', '-f', 'wav', '-y', tmpWav],
{ timeout: 30_000 },
);

const { stdout } = await execFileAsync(
WHISPER_BIN,
['-m', WHISPER_MODEL, '-f', tmpWav, '--no-timestamps', '-nt'],
{ timeout: 60_000 },
);

const transcript = stdout.trim();
if (!transcript) return null;

logger.info(
{ bin: WHISPER_BIN, model: WHISPER_MODEL, chars: transcript.length },
'whisper.cpp transcription complete',
);
return transcript;
} catch (err) {
logger.error({ err }, 'whisper.cpp transcription failed');
return null;
} finally {
for (const f of [tmpIn, tmpWav]) {
try { fs.unlinkSync(f); } catch { /* best-effort cleanup */ }
}
}
}
```

### Merge the skill branch
Then build:

```bash
git fetch whatsapp skill/local-whisper
git merge whatsapp/skill/local-whisper || {
git checkout --theirs package-lock.json
git add package-lock.json
git merge --continue
}
npm run build
```

This modifies `src/transcription.ts` to use the `whisper-cli` binary instead of the OpenAI API.
## Phase 4: Configure PATH (if needed)

### Validate
The nanoclaw service may run with a restricted PATH. Verify `whisper-cli` is reachable:

```bash
npm run build
which whisper-cli
```

## Phase 3: Verify
If not found, set `WHISPER_BIN` in `.env` to the absolute path:

### Ensure launchd PATH includes Homebrew
```
WHISPER_BIN=/home/youruser/.local/bin/whisper-cli
```

The NanoClaw launchd service runs with a restricted PATH. `whisper-cli` and `ffmpeg` are in `/opt/homebrew/bin/` (Apple Silicon) or `/usr/local/bin/` (Intel), which may not be in the plist's PATH.
Sync to container environment:

Check the current PATH:
```bash
grep -A1 'PATH' ~/Library/LaunchAgents/com.nanoclaw.plist
mkdir -p data/env && cp .env data/env/env
```

If `/opt/homebrew/bin` is missing, add it to the `<string>` value inside the `PATH` key in the plist. Then reload:
**macOS launchd only:** If using launchd, add `/opt/homebrew/bin` to the PATH key in the plist, then reload:
```bash
launchctl unload ~/Library/LaunchAgents/com.nanoclaw.plist
launchctl load ~/Library/LaunchAgents/com.nanoclaw.plist
```

### Build and restart
## Phase 5: Build and Restart

```bash
npm run build
# Linux (systemd):
kill -TERM $(pgrep -f "nanoclaw/dist/index.js") # systemd Restart=always brings it back
# macOS (launchd):
launchctl kickstart -k gui/$(id -u)/com.nanoclaw
```

### Test

Send a voice note in any registered group. The agent should receive it as `[Voice: <transcript>]`.
## Phase 6: Verify

### Check logs
Send a voice message to any registered chat. The agent should receive it as `[Voice: <transcript>]`.

Check logs:
```bash
tail -f logs/nanoclaw.log | grep -i -E "voice|transcri|whisper"
```

Look for:
- `Transcribed voice message` — successful transcription
- `whisper.cpp transcription failed` — check model path, ffmpeg, or PATH
- `whisper.cpp transcription complete` — success
- `whisper.cpp transcription failed` — check PATH, model path, ffmpeg

## Configuration
## Troubleshooting

Environment variables (optional, set in `.env`):
**"whisper.cpp transcription failed"**
- Verify both `whisper-cli` and `ffmpeg` are in PATH (or set `WHISPER_BIN` in `.env`)
- Test manually:
```bash
ffmpeg -f lavfi -i anullsrc=r=16000:cl=mono -t 1 -f wav /tmp/test.wav -y
whisper-cli -m data/models/ggml-base.bin -f /tmp/test.wav --no-timestamps -nt
```

**Falls back to `[Voice message] (/path/to/file.oga)` instead of transcribing**
- Transcription returned null — check the above test
- Check `WHISPER_MODEL` path exists: `ls data/models/ggml-base.bin`

**Slow transcription**
- The base model processes ~30s of audio in <1s on Apple Silicon, ~5s on x86_64
- Use `ggml-small.bin` only if accuracy is insufficient — speed tradeoff

## Configuration

| Variable | Default | Description |
|----------|---------|-------------|
| `WHISPER_BIN` | `whisper-cli` | Path to whisper.cpp binary |
| `WHISPER_MODEL` | `data/models/ggml-base.bin` | Path to GGML model file |

## Troubleshooting

**"whisper.cpp transcription failed"**: Ensure both `whisper-cli` and `ffmpeg` are in PATH. The launchd service uses a restricted PATH — see Phase 3 above. Test manually:
```bash
ffmpeg -f lavfi -i anullsrc=r=16000:cl=mono -t 1 -f wav /tmp/test.wav -y
whisper-cli -m data/models/ggml-base.bin -f /tmp/test.wav --no-timestamps -nt
```

**Transcription works in dev but not as service**: The launchd plist PATH likely doesn't include `/opt/homebrew/bin`. See "Ensure launchd PATH includes Homebrew" in Phase 3.

**Slow transcription**: The base model processes ~30s of audio in <1s on M1+. If slower, check CPU usage — another process may be competing.

**Wrong language**: whisper.cpp auto-detects language. To force a language, you can set `WHISPER_LANG` and modify `src/transcription.ts` to pass `-l $WHISPER_LANG`.
2 changes: 1 addition & 1 deletion container/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ ENV AGENT_BROWSER_EXECUTABLE_PATH=/usr/bin/chromium
ENV PLAYWRIGHT_CHROMIUM_EXECUTABLE_PATH=/usr/bin/chromium

# Install agent-browser and claude-code globally
RUN npm install -g agent-browser @anthropic-ai/claude-code
RUN npm install -g agent-browser @anthropic-ai/claude-code mcp-remote

# Create app directory
WORKDIR /app
Expand Down
8 changes: 8 additions & 0 deletions container/agent-runner/src/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,14 @@ async function runQuery(
NANOCLAW_IS_MAIN: containerInput.isMain ? '1' : '0',
},
},
'ha-mcp': {
command: 'npx',
args: ['-y', 'mcp-remote', 'http://host.docker.internal:9583/private_PWWE28FuDIflsITGNI9VDQ', '--allow-http'],
env: {
NO_PROXY: 'host.docker.internal',
no_proxy: 'host.docker.internal',
},
},
},
hooks: {
PreCompact: [
Expand Down
Loading