Skip to content

JonaWhisper/jonawhisper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

887 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JonaWhisper

Local-first voice-to-text dictation app. Runs in the system tray, records audio via a global hotkey, transcribes with your choice of speech recognition engine, and pastes the result into the active application.

All processing happens on your machine — no data leaves your computer unless you choose a cloud provider. API keys are stored securely in the OS keychain, never in plaintext files.

CI

Features

  • Menu bar app — lives in the system tray, no dock icon
  • Global hotkey — push-to-talk or toggle mode, any key combination (modifier, combo, or standalone key)
  • 6 ASR engines — Whisper, Canary, Parakeet-TDT, Qwen3-ASR, Voxtral (all local, GPU-accelerated), plus any OpenAI-compatible cloud API
  • 24 cloud presets — OpenAI, Groq, Cerebras, Gemini, Mistral, Fireworks, Together, DeepSeek, OpenRouter, xAI, SambaNova, Nebius, Anthropic, Deepgram, GitHub Copilot, Gemini ASR, Rev.ai, AssemblyAI, ElevenLabs, Cohere, Gladia, Speechmatics, Azure Speech, Google Speech
  • Text cleanup pipeline — VAD silence trimming, hallucination filter, dictation commands, punctuation (BERT/PCS), grammar correction (T5), or full LLM cleanup (local/cloud)
  • Floating pill — real-time spectrum visualization, recording/transcribing states, cancel support
  • Model manager — parallel downloads with progress, pause/resume, benchmarks
  • Paginated history — backend-driven search, infinite scroll, processing badges
  • Bilingual UI — French and English

Engines

ASR (Speech-to-Text)

Engine Params Size Languages GPU Best WER RTF
Whisper 39M–1.55B 75 MB–3.1 GB 99 Metal 1.5% 0.05–0.50
Canary 182M 213 MB 4 (FR/EN/DE/ES) CoreML 1.87% 0.15
Parakeet-TDT 600M 703 MB 25 European CoreML 1.5% 0.10
Qwen3-ASR 600M 1.88 GB 30 Accelerate/AMX 2.0% 0.15
Voxtral 4.4B 8.9 GB 13 Metal 8.7% 0.40
Cloud (OpenAI API) depends on provider

Recommendations: Parakeet-TDT for best accuracy (1.5% WER, 25 languages). Whisper V3 Turbo for best balance (2.1% WER, 99 languages, 0.25 RTF). Whisper Tiny for speed (75 MB, 0.05 RTF).

Punctuation & Capitalization

Engine Params Size Languages Capitalization Speed
BERT Fullstop Large (ort) 560M 562 MB 4 (FR/EN/DE/IT) No ~100ms
BERT Fullstop Base (Candle) 280M 1.1 GB 5 (+ NL) No ~80ms
PCS 47 Languages (ort) 230M 233 MB 47 Yes (4 heads) ~50ms

Recommendation: PCS — smaller, faster, 47 languages, native capitalization.

Grammar & Spelling Correction

Engine Params Size Languages Speed
GEC T5 Small 60M 96 MB 11 (multilingual) ~200ms
T5 Spell FR 220M 276 MB FR ~500ms
FlanEC Base 250M 276 MB EN ~500ms
FlanEC Large 800M 821 MB EN ~1s

All run via ONNX Runtime with CoreML, autoregressive decoding with repeat penalty and n-gram blocking.

Local LLM (Text Cleanup)

Model Params Size Languages
Qwen3 0.6B 0.6B 484 MB FR/EN/ES/DE
Gemma 3 1B 1.0B 806 MB FR/EN/ES/DE
Qwen3 1.7B 1.7B 1.28 GB FR/EN/ES/DE
Ministral 3B 3.0B 2.15 GB FR/EN/ES/DE
Qwen3 4B 4.0B 2.50 GB FR/EN/ES/DE

All GGUF Q4 quantized, run via llama.cpp with Metal GPU. 11 models available total.

Processing pipelines

Audio pipeline

Mic (cpal) → WAV 16 kHz → VAD (Silero v6.2) → Trim silence → ASR
Stage Status Description
VAD (Silero) Done Discards silent recordings, trims leading/trailing silence
Denoising Planned Hybrid approach: denoised for VAD boundaries, original for ASR
Device presets Planned Per-mic gain, noise gate, normalization

See docs/AUDIO-PIPELINE.md for the full architecture.

Text pipeline

ASR raw → Hallucination filter → Dictation commands → Disfluency removal → Punctuation → Spell-check → Correction/LLM → Finalize → ITN → Paste
Stage Status Description
Hallucination filter Done 30+ regex patterns (FR/EN)
Dictation commands Done Voice commands → punctuation ("virgule" → ",")
Disfluency removal Done Strip fillers (euh, uh, um) — regex-based, FR/EN
Punctuation Done BERT or PCS token classification
Correction Done T5 encoder-decoder (grammar, spelling)
LLM cleanup Done Local (llama.cpp) or cloud (OpenAI/Anthropic)
ITN Done Inverse text normalization — numbers, ordinals, %, hours, currencies, units (FR/EN)
Finalize Done Spacing, capitalization

See docs/TEXT-PIPELINE.md for the full architecture.

Requirements

  • macOS 14.0+ (Apple Silicon)
  • Rust (stable)
  • Node.js 24+
  • Xcode Command Line Tools (xcode-select --install)

Build

./build.sh
open build/JonaWhisper.app

The build script installs npm dependencies automatically, then produces build/JonaWhisper.app and build/JonaWhisper.dmg. Tauri handles code signing automatically — if a Developer certificate is available (via APPLE_SIGNING_IDENTITY), the app is signed with hardened runtime and entitlements. Notarization is supported via APPLE_ID, APPLE_PASSWORD, and APPLE_TEAM_ID environment variables.

For a debug build:

./build.sh debug

Development

npm install
npm run tauri dev

This starts the Vite dev server with hot reload for the frontend. Rust changes trigger a rebuild automatically. Note: npm install is only needed once for dev mode — build.sh handles it automatically for production builds.

Permissions

On first launch, a setup wizard asks for three macOS permissions:

Permission Used for macOS API
Microphone Audio recording AVCaptureDevice authorization
Accessibility Paste simulation (Cmd+V via CGEvent) AXIsProcessTrusted
Input Monitoring Global hotkey detection (CGEvent tap) TCC ListenEvent

Usage

  1. Launch the app — it appears as a menu bar icon
  2. Press and hold the hotkey (default: Right Command) to record
  3. Release to transcribe — the text is pasted into the active app
  4. In toggle mode, press once to start, press again to stop

Cancel: Press Escape at any time to cancel recording or transcription.

Settings: Open from the tray menu to configure language, ASR model, text cleanup, hotkey, microphone, and cloud providers.

Models are downloaded and managed from within the app. All models are stored in ~/Library/Application Support/JonaWhisper/models/.

Tech stack

Layer Technologies
Framework Tauri 2
Backend Rust
Frontend Vue 3, TypeScript, Pinia, Tailwind CSS, shadcn-vue
Audio cpal + hound (recording), rustfft (spectrum), CoreAudio FFI (ducking)
ASR whisper-rs (Metal), ort + CoreML (Canary, Parakeet), qwen-asr (AMX), voxtral.c (Metal)
Text cleanup ort + CoreML (BERT, PCS, T5 correction), candle (BERT Metal), llama-cpp-2 (local LLM Metal)
Icons SDF (Signed Distance Field) in Rust, RGBA bitmaps — zero image dependencies
Hotkey Raw CGEvent tap (CoreGraphics FFI)
Permissions objc2 (AVFoundation, CoreGraphics, ApplicationServices)
i18n vue-i18n (frontend), rust-i18n (backend)

See docs/DEPENDENCIES.md for the full dependency list with rationale for each choice.

Architecture

JonaWhisper follows a thin orchestrator design. The main Tauri crate (src-tauri/src/) contains no engine logic nor model definitions — it orchestrates infrastructure crates and independent engine crates via dynamic dispatch.

Workspace crates (26 total):

Crate Role
jona-types Shared types (ASREngine trait, ASRModel, EngineError, Preferences, Provider)
jona-engines Infrastructure: EngineCatalog, downloader, ort session builder, mel features
jona-platform OS-specific code: hotkey (CGEvent tap), permissions, paste, audio devices, ducking
jona-provider + jona-provider-* (13) Cloud provider backends (OpenAI-compatible, Anthropic, Deepgram, Copilot, Gemini ASR, Rev.ai, AssemblyAI, ElevenLabs, Cohere, Gladia, Speechmatics, Azure Speech, Google Speech)
jona-engine-* (11) One crate per engine: whisper, canary, parakeet, qwen, voxtral, llama, bert, pcs, correction, spellcheck, lm

Plug-and-play engines: each engine crate registers itself via inventory::submit! at link time. The main crate calls EngineCatalog::init_auto() at startup — no hardcoded engine list, no re-exports. Adding an engine = add a crate + cargo dependency, zero changes to the orchestrator.

Dynamic dispatch: all ASR and cleanup operations go through the ASREngine trait. The orchestrator resolves models by ID, obtains the engine, and calls transcribe() or cleanup() — it never knows which engine it is talking to.

Project structure

src/                       Vue frontend
  views/                   Pages (Panel, SetupWizard)
  sections/                Settings sections (10 sections)
  components/              UI components
  stores/                  Pinia stores (app, history, settings, engines, downloads)
  utils/                   Shared utilities (shortcut, formatting)
  i18n/                    Translations (en.json, fr.json)
src-tauri/                 Rust backend (thin orchestrator)
  src/
    lib.rs                 Tauri setup & app lifecycle (10 modules)
    state.rs               App state & persistent preferences
    recording/             Recording lifecycle, transcription pipeline, thread spawning
    commands/              Tauri IPC commands (7 sub-modules: audio, engines, history, ...)
    audio.rs               cpal recording & FFT
    cleanup/               Text cleanup (bert, candle, pcs, t5, vad, llm, post_processor)
    platform/              macOS-specific (hotkey, permissions, paste, audio devices)
    ui/                    Native UI (tray, pill overlay, SDF icons)
  voxtral-c/               Vendored voxtral.c sources (C + Metal)
crates/                    Workspace crates (types, engines, platform, 13 provider crates, 11 engine crates)
docs/                      Technical documentation
  AUDIO-PIPELINE.md        Audio preprocessing architecture & roadmap
  TEXT-PIPELINE.md         Text postprocessing architecture & roadmap
  BENCHMARK.md             Full benchmark data (ASR, LLM, punctuation, correction)
build.sh                   Build + package script (signing handled by Tauri)

See ARCHITECTURE.md for the full architecture guide with data flows, threading model, and module responsibilities.

Contributing

See CONTRIBUTING.md for commit conventions, development setup, and how to submit PRs.

License

This project is licensed under the GNU General Public License v3.0 or later.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors