Local-first voice-to-text dictation app. Runs in the system tray, records audio via a global hotkey, transcribes with your choice of speech recognition engine, and pastes the result into the active application.
All processing happens on your machine — no data leaves your computer unless you choose a cloud provider. API keys are stored securely in the OS keychain, never in plaintext files.
- Menu bar app — lives in the system tray, no dock icon
- Global hotkey — push-to-talk or toggle mode, any key combination (modifier, combo, or standalone key)
- 6 ASR engines — Whisper, Canary, Parakeet-TDT, Qwen3-ASR, Voxtral (all local, GPU-accelerated), plus any OpenAI-compatible cloud API
- 24 cloud presets — OpenAI, Groq, Cerebras, Gemini, Mistral, Fireworks, Together, DeepSeek, OpenRouter, xAI, SambaNova, Nebius, Anthropic, Deepgram, GitHub Copilot, Gemini ASR, Rev.ai, AssemblyAI, ElevenLabs, Cohere, Gladia, Speechmatics, Azure Speech, Google Speech
- Text cleanup pipeline — VAD silence trimming, hallucination filter, dictation commands, punctuation (BERT/PCS), grammar correction (T5), or full LLM cleanup (local/cloud)
- Floating pill — real-time spectrum visualization, recording/transcribing states, cancel support
- Model manager — parallel downloads with progress, pause/resume, benchmarks
- Paginated history — backend-driven search, infinite scroll, processing badges
- Bilingual UI — French and English
| Engine | Params | Size | Languages | GPU | Best WER | RTF |
|---|---|---|---|---|---|---|
| Whisper | 39M–1.55B | 75 MB–3.1 GB | 99 | Metal | 1.5% | 0.05–0.50 |
| Canary | 182M | 213 MB | 4 (FR/EN/DE/ES) | CoreML | 1.87% | 0.15 |
| Parakeet-TDT | 600M | 703 MB | 25 European | CoreML | 1.5% | 0.10 |
| Qwen3-ASR | 600M | 1.88 GB | 30 | Accelerate/AMX | 2.0% | 0.15 |
| Voxtral | 4.4B | 8.9 GB | 13 | Metal | 8.7% | 0.40 |
| Cloud (OpenAI API) | — | — | depends on provider | — | — | — |
Recommendations: Parakeet-TDT for best accuracy (1.5% WER, 25 languages). Whisper V3 Turbo for best balance (2.1% WER, 99 languages, 0.25 RTF). Whisper Tiny for speed (75 MB, 0.05 RTF).
| Engine | Params | Size | Languages | Capitalization | Speed |
|---|---|---|---|---|---|
| BERT Fullstop Large (ort) | 560M | 562 MB | 4 (FR/EN/DE/IT) | No | ~100ms |
| BERT Fullstop Base (Candle) | 280M | 1.1 GB | 5 (+ NL) | No | ~80ms |
| PCS 47 Languages (ort) | 230M | 233 MB | 47 | Yes (4 heads) | ~50ms |
Recommendation: PCS — smaller, faster, 47 languages, native capitalization.
| Engine | Params | Size | Languages | Speed |
|---|---|---|---|---|
| GEC T5 Small | 60M | 96 MB | 11 (multilingual) | ~200ms |
| T5 Spell FR | 220M | 276 MB | FR | ~500ms |
| FlanEC Base | 250M | 276 MB | EN | ~500ms |
| FlanEC Large | 800M | 821 MB | EN | ~1s |
All run via ONNX Runtime with CoreML, autoregressive decoding with repeat penalty and n-gram blocking.
| Model | Params | Size | Languages |
|---|---|---|---|
| Qwen3 0.6B | 0.6B | 484 MB | FR/EN/ES/DE |
| Gemma 3 1B | 1.0B | 806 MB | FR/EN/ES/DE |
| Qwen3 1.7B | 1.7B | 1.28 GB | FR/EN/ES/DE |
| Ministral 3B | 3.0B | 2.15 GB | FR/EN/ES/DE |
| Qwen3 4B | 4.0B | 2.50 GB | FR/EN/ES/DE |
All GGUF Q4 quantized, run via llama.cpp with Metal GPU. 11 models available total.
Mic (cpal) → WAV 16 kHz → VAD (Silero v6.2) → Trim silence → ASR
| Stage | Status | Description |
|---|---|---|
| VAD (Silero) | Done | Discards silent recordings, trims leading/trailing silence |
| Denoising | Planned | Hybrid approach: denoised for VAD boundaries, original for ASR |
| Device presets | Planned | Per-mic gain, noise gate, normalization |
See docs/AUDIO-PIPELINE.md for the full architecture.
ASR raw → Hallucination filter → Dictation commands → Disfluency removal → Punctuation → Spell-check → Correction/LLM → Finalize → ITN → Paste
| Stage | Status | Description |
|---|---|---|
| Hallucination filter | Done | 30+ regex patterns (FR/EN) |
| Dictation commands | Done | Voice commands → punctuation ("virgule" → ",") |
| Disfluency removal | Done | Strip fillers (euh, uh, um) — regex-based, FR/EN |
| Punctuation | Done | BERT or PCS token classification |
| Correction | Done | T5 encoder-decoder (grammar, spelling) |
| LLM cleanup | Done | Local (llama.cpp) or cloud (OpenAI/Anthropic) |
| ITN | Done | Inverse text normalization — numbers, ordinals, %, hours, currencies, units (FR/EN) |
| Finalize | Done | Spacing, capitalization |
See docs/TEXT-PIPELINE.md for the full architecture.
- macOS 14.0+ (Apple Silicon)
- Rust (stable)
- Node.js 24+
- Xcode Command Line Tools (
xcode-select --install)
./build.sh
open build/JonaWhisper.appThe build script installs npm dependencies automatically, then produces build/JonaWhisper.app and build/JonaWhisper.dmg. Tauri handles code signing automatically — if a Developer certificate is available (via APPLE_SIGNING_IDENTITY), the app is signed with hardened runtime and entitlements. Notarization is supported via APPLE_ID, APPLE_PASSWORD, and APPLE_TEAM_ID environment variables.
For a debug build:
./build.sh debugnpm install
npm run tauri devThis starts the Vite dev server with hot reload for the frontend. Rust changes trigger a rebuild automatically. Note: npm install is only needed once for dev mode — build.sh handles it automatically for production builds.
On first launch, a setup wizard asks for three macOS permissions:
| Permission | Used for | macOS API |
|---|---|---|
| Microphone | Audio recording | AVCaptureDevice authorization |
| Accessibility | Paste simulation (Cmd+V via CGEvent) | AXIsProcessTrusted |
| Input Monitoring | Global hotkey detection (CGEvent tap) | TCC ListenEvent |
- Launch the app — it appears as a menu bar icon
- Press and hold the hotkey (default: Right Command) to record
- Release to transcribe — the text is pasted into the active app
- In toggle mode, press once to start, press again to stop
Cancel: Press Escape at any time to cancel recording or transcription.
Settings: Open from the tray menu to configure language, ASR model, text cleanup, hotkey, microphone, and cloud providers.
Models are downloaded and managed from within the app. All models are stored in ~/Library/Application Support/JonaWhisper/models/.
| Layer | Technologies |
|---|---|
| Framework | Tauri 2 |
| Backend | Rust |
| Frontend | Vue 3, TypeScript, Pinia, Tailwind CSS, shadcn-vue |
| Audio | cpal + hound (recording), rustfft (spectrum), CoreAudio FFI (ducking) |
| ASR | whisper-rs (Metal), ort + CoreML (Canary, Parakeet), qwen-asr (AMX), voxtral.c (Metal) |
| Text cleanup | ort + CoreML (BERT, PCS, T5 correction), candle (BERT Metal), llama-cpp-2 (local LLM Metal) |
| Icons | SDF (Signed Distance Field) in Rust, RGBA bitmaps — zero image dependencies |
| Hotkey | Raw CGEvent tap (CoreGraphics FFI) |
| Permissions | objc2 (AVFoundation, CoreGraphics, ApplicationServices) |
| i18n | vue-i18n (frontend), rust-i18n (backend) |
See docs/DEPENDENCIES.md for the full dependency list with rationale for each choice.
JonaWhisper follows a thin orchestrator design. The main Tauri crate (src-tauri/src/) contains no engine logic nor model definitions — it orchestrates infrastructure crates and independent engine crates via dynamic dispatch.
Workspace crates (26 total):
| Crate | Role |
|---|---|
jona-types |
Shared types (ASREngine trait, ASRModel, EngineError, Preferences, Provider) |
jona-engines |
Infrastructure: EngineCatalog, downloader, ort session builder, mel features |
jona-platform |
OS-specific code: hotkey (CGEvent tap), permissions, paste, audio devices, ducking |
jona-provider + jona-provider-* (13) |
Cloud provider backends (OpenAI-compatible, Anthropic, Deepgram, Copilot, Gemini ASR, Rev.ai, AssemblyAI, ElevenLabs, Cohere, Gladia, Speechmatics, Azure Speech, Google Speech) |
jona-engine-* (11) |
One crate per engine: whisper, canary, parakeet, qwen, voxtral, llama, bert, pcs, correction, spellcheck, lm |
Plug-and-play engines: each engine crate registers itself via inventory::submit! at link time. The main crate calls EngineCatalog::init_auto() at startup — no hardcoded engine list, no re-exports. Adding an engine = add a crate + cargo dependency, zero changes to the orchestrator.
Dynamic dispatch: all ASR and cleanup operations go through the ASREngine trait. The orchestrator resolves models by ID, obtains the engine, and calls transcribe() or cleanup() — it never knows which engine it is talking to.
src/ Vue frontend
views/ Pages (Panel, SetupWizard)
sections/ Settings sections (10 sections)
components/ UI components
stores/ Pinia stores (app, history, settings, engines, downloads)
utils/ Shared utilities (shortcut, formatting)
i18n/ Translations (en.json, fr.json)
src-tauri/ Rust backend (thin orchestrator)
src/
lib.rs Tauri setup & app lifecycle (10 modules)
state.rs App state & persistent preferences
recording/ Recording lifecycle, transcription pipeline, thread spawning
commands/ Tauri IPC commands (7 sub-modules: audio, engines, history, ...)
audio.rs cpal recording & FFT
cleanup/ Text cleanup (bert, candle, pcs, t5, vad, llm, post_processor)
platform/ macOS-specific (hotkey, permissions, paste, audio devices)
ui/ Native UI (tray, pill overlay, SDF icons)
voxtral-c/ Vendored voxtral.c sources (C + Metal)
crates/ Workspace crates (types, engines, platform, 13 provider crates, 11 engine crates)
docs/ Technical documentation
AUDIO-PIPELINE.md Audio preprocessing architecture & roadmap
TEXT-PIPELINE.md Text postprocessing architecture & roadmap
BENCHMARK.md Full benchmark data (ASR, LLM, punctuation, correction)
build.sh Build + package script (signing handled by Tauri)
See ARCHITECTURE.md for the full architecture guide with data flows, threading model, and module responsibilities.
See CONTRIBUTING.md for commit conventions, development setup, and how to submit PRs.
This project is licensed under the GNU General Public License v3.0 or later.