Chinese/multilingual speech recognition service based on NVIDIA NeMo, providing an OpenAI Whisper-compatible /v1/audio/transcriptions interface. Packaged as a GPU-enabled Docker image for one-click deployment.
- Pre-configured model: default
nvidia/parakeet-tdt-0.6b-v3 - Supports 25 languages with automatic language detection
- Supports long audio chunking with overlapping stitching, provides SRT/VTT/verbose_json output formats
- Automatically detects CUDA compatibility: falls back to CPU mode when incompatible or no GPU is present (slower)
- OpenAI Whisper API compatible format, including error responses
- Quick Start (PowerShell)
- Prerequisites
- Running with Pre-built Images
- Building and Running from Source
- API Usage Examples
- Language Detection and Support
- Configuration and Environment Variables
- Ports, Volumes and File Structure
- Health Checks and Monitoring
- Frequently Asked Questions and Troubleshooting
- License and Acknowledgements
- Prepare directories and start the container (using pre-built image)
# Execute in the repository root directory
mkdir -p ./models ./temp_uploads
# Start (requires NVIDIA Container Toolkit installed)
docker compose up -d
# View logs (optional)
docker compose logs -f- Health checks
- Simple health:
http://localhost:5092/health/simple - Detailed health:
http://localhost:5092/health
- Test the API (example: JSON text output)
# Example using curl
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-F "file=@/path/to/audio.mp3" \
-F "model=whisper-1" \
-F "response_format=json"If API Key is enabled, add
-H "Authorization: Bearer YOUR_API_KEY".
- Operating System: Linux/macOS/Windows
- Docker: Docker Desktop or Docker Engine (Compose V2)
- GPU (optional but recommended):
- NVIDIA GPU with drivers installed (recommended 535+), and NVIDIA Container Toolkit installed
- Image is based on
nvidia/cuda:13.0.0-runtime-ubuntu22.04, requiring compatible drivers
Can also run without GPU (automatic CPU mode), but inference speed will be significantly slower.
The project provides docker-compose.yml which by default pulls the image ghcr.io/fqscfqj/parakeet-api-docker:full.
mkdir -p ./models ./temp_uploads
docker compose up -d
# To update the image
# docker compose pull; docker compose up -dCompose main configuration:
- Port mapping:
5092:5092 - Volumes:
./models:/app/models(models and cache)./temp_uploads:/app/temp_uploads(temporary transcoding and chunking files)
- GPU: Requests all available GPUs via
deploy.resources.reservations.devices
If you need to customize the Dockerfile or accelerate domestic builds, use docker-compose-build.yml:
mkdir -p ./models ./temp_uploads
docker compose -f docker-compose-build.yml up -d --buildBuilt image includes:
- Python 3.10 + Pip
- PyTorch/cu130 + torchaudio (from official CUDA 13.0 wheels)
- NeMo ASR and dependencies, FFmpeg, health check script
- Endpoint:
POST /v1/audio/transcriptions - Fields (multipart/form-data):
file: Audio/video filemodel: Compatibility field, defaultwhisper-1response_format:json|text|srt|vtt|verbose_jsonlanguage: Optional, default automaticprompt,temperature: Optional
Example: Return SRT subtitles
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-F "file=@/path/to/audio.wav" \
-F "model=whisper-1" \
-F "response_format=srt"With API Key enabled:
# After setting environment variable API_KEY in docker-compose.yml, include header when calling
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@/path/to/audio.mp3" -F "response_format=json"This API supports transcription of the following 25 languages (based on parakeet-tdt-0.6b-v3 model):
| Language Code | Language Name | Language Code | Language Name | Language Code | Language Name |
|---|---|---|---|---|---|
| bg | Bulgarian | hr | Croatian | cs | Czech |
| da | Danish | nl | Dutch | en | English |
| et | Estonian | fi | Finnish | fr | French |
| de | German | el | Greek | hu | Hungarian |
| it | Italian | lv | Latvian | lt | Lithuanian |
| mt | Maltese | pl | Polish | pt | Portuguese |
| ro | Romanian | sk | Slovak | sl | Slovenian |
| es | Spanish | sv | Swedish | ru | Russian |
| uk | Ukrainian |
When the language parameter is not specified in the request, the system automatically detects the audio language:
-
Detection process:
- Extract the beginning of the audio (default 45 seconds) for quick transcription
- Use the langdetect library to analyze the language of the transcribed text
- If a supported language is detected, use that language for full transcription
- If an unsupported language is detected, handle according to
ENABLE_AUTO_LANGUAGE_REJECTIONsetting
-
Processing rules:
- Explicitly specified language: Verify if the language is in the supported list; return OpenAI format error if not supported
- Automatically detected supported language: Use the detected language for transcription
- Automatically detected unsupported language:
- If
ENABLE_AUTO_LANGUAGE_REJECTION=true: return OpenAI format error - If
ENABLE_AUTO_LANGUAGE_REJECTION=false: default to English for transcription
- If
-
Response format:
{ "text": "transcribed text content", "language": "auto-detected-lang-code" // Only returned in verbose_json format } -
Configuration options:
ENABLE_AUTO_LANGUAGE_REJECTION: Whether to reject unsupported languages (defaulttrue)LID_CLIP_SECONDS: Audio clip length for language detection (default45seconds)
# Explicitly specify a supported language
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-F "[email protected]" \
-F "language=en" \
-F "response_format=json"
# Automatically detect language
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-F "[email protected]" \
-F "response_format=verbose_json" # Returns detected language
# Explicitly specify an unsupported language (returns error)
curl -X POST "http://localhost:5092/v1/audio/transcriptions" \
-F "[email protected]" \
-F "language=zh" # Returns OpenAI format error responseCommon environment variables (can be set in Compose's environment:):
-
Model and Loading
MODEL_ID: Defaultnvidia/parakeet-tdt-0.6b-v3MODEL_LOCAL_PATH: Priority path to load local.nemofile (after mounting to./models, can point to/app/models/xxx.nemo)ENABLE_LAZY_LOAD: Whether to lazy load model (defaulttrue)IDLE_TIMEOUT_MINUTES: Minutes to auto-unload model when idle,0to disable (default30)API_KEY: If set, enables Bearer Token authenticationHF_ENDPOINT: Hugging Face mirror endpoint, defaulthttps://hf-mirror.com
-
Performance and GPU Memory
PRESET:speed|balanced|quality|simple(=balanced). Used to derive parameters at startupGPU_VRAM_GB: GPU memory capacity (integer, GB). If not set, will try to auto-detectCHUNK_MINUTE: Chunk duration per segment (minutes, default10, can be lowered for less GPU memory)MAX_CONCURRENT_INFERENCES: Maximum concurrent inferences (default1)GPU_MEMORY_FRACTION: GPU memory fraction available to single process (default0.90~0.95)DECODING_STRATEGY:greedy|beam,RNNT_BEAM_SIZE: Beam widthAGGRESSIVE_MEMORY_CLEANUP: Aggressive GPU memory cleanup (defaulttrue)ENABLE_TENSOR_CORE,ENABLE_CUDNN_BENCHMARK,TENSOR_CORE_PRECISION: Tensor Core/Benchmark related
-
Idle Resource Optimization
IDLE_MEMORY_CLEANUP_INTERVAL: Idle memory cleanup interval (seconds, default120)IDLE_DEEP_CLEANUP_THRESHOLD: Deep cleanup threshold (seconds, default600)ENABLE_IDLE_CPU_OPTIMIZATION: Enable CPU optimization when idle (defaulttrue)IDLE_MONITORING_INTERVAL: Idle monitoring interval (seconds, default30)ENABLE_AGGRESSIVE_IDLE_OPTIMIZATION: Enable aggressive memory optimization (defaulttrue)IMMEDIATE_CLEANUP_AFTER_REQUEST: Immediate cleanup after request completion (defaulttrue)MEMORY_USAGE_ALERT_THRESHOLD_GB: Force cleanup when memory usage exceeds this value (default6.0GB)AUTO_MODEL_UNLOAD_THRESHOLD_MINUTES: Auto model unload threshold (default10minutes)
💡 Resource Optimization Tip: The new version significantly enhances idle optimization strategies, reducing 8GB idle memory to 2-3GB. With aggressive optimization enabled, the system performs multiple rounds of deep cleanup when the model is idle. Monitor
idle_statusand resource usage via the/healthendpoint.
-
Chunking and Sentence Integrity
ENABLE_OVERLAP_CHUNKING: Overlapping chunks (defaulttrue),CHUNK_OVERLAP_SECONDS: Overlap seconds (default30)ENABLE_SILENCE_ALIGNED_CHUNKING: Silence-aligned splitting (defaulttrue)SILENCE_THRESHOLD_DB(default-38dB),MIN_SILENCE_DURATION(default0.35),SILENCE_MAX_SHIFT_SECONDS(default2.0)
-
Subtitle Post-processing and Line Breaks
MERGE_SHORT_SUBTITLES(defaulttrue),MIN_SUBTITLE_DURATION_SECONDS(default1.5)SHORT_SUBTITLE_MERGE_MAX_GAP_SECONDS,SHORT_SUBTITLE_MIN_CHARS,SUBTITLE_MIN_GAP_SECONDSSPLIT_LONG_SUBTITLES(defaulttrue),MAX_SUBTITLE_DURATION_SECONDS(default6.0)MAX_SUBTITLE_CHARS_PER_SEGMENT(default84),PREFERRED_LINE_LENGTH(default42),MAX_SUBTITLE_LINES(default2)ENABLE_WORD_TIMESTAMPS_FOR_SPLIT(defaultfalse)
-
Other
ENABLE_FFMPEG_DENOISE(defaultfalse),DENOISE_FILTER: FFmpeg denoise/equalizer/dynamic range preprocessingNUMBA_CACHE_DIR(default/tmp/numba_cache): Already handled and permissions assigned in imagePUID/PGID: Container will switch running user to specified UID/GID at startup, facilitating volume permission management
Tip: If you just want "it to work", keep default values; if encountering GPU memory shortage, reduce
CHUNK_MINUTE, setPRESET=quality, or setDECODING_STRATEGY=greedy.
- Port: Container listens on
5092internally, can be changed to other host ports in Compose - Volumes:
./models:/app/models: Save/cache models (prioritizes loading.nemo)./temp_uploads:/app/temp_uploads: Temporary transcoding and chunking data
- Key files:
app.py: Flask + Waitress service, providing API and chunking/post-processing logicDockerfile: CUDA 13.0 runtime + dependency installation + health check + startup scriptdocker-compose.yml: Using pre-built imagedocker-compose-build.yml: Local buildinghealthcheck.sh: Container health check script
/health/simple: Returns 200 for alive status/health: Returns JSON with GPU/CPU, memory and model loading status, etc.- Built-in
HEALTHCHECKin container, Compose/Orchestration platforms can use this for restart policies
-
Q: Log indicates "CUDA unavailable/compatibility error", service falls back to CPU?
- A: Check that host NVIDIA drivers meet CUDA 13.x runtime requirements; confirm NVIDIA Container Toolkit is installed; verify device reservations in Compose are effective. CPU can be used when requirements cannot be met, but speed will be slower.
-
Q: First startup model loading is slow or fails?
- A: By default pulls from Hugging Face, set
MODEL_LOCAL_PATHto point to local.nemo; or configureHF_ENDPOINTto use a mirror. Ensure./modelsvolume is writable.
- A: By default pulls from Hugging Face, set
-
Q: GPU memory shortage/frequent OOM?
- A: Reduce
CHUNK_MINUTE(e.g. 6~8); setDECODING_STRATEGY=greedy;PRESET=qualityautomatically lowers concurrency and GPU memory share; disableENABLE_OVERLAP_CHUNKINGif necessary.
- A: Reduce
-
Q: Want to further optimize resource usage when idle?
- A: The new version provides aggressive memory optimization strategies:
ENABLE_AGGRESSIVE_IDLE_OPTIMIZATION=true: Enable aggressive memory cleanupIMMEDIATE_CLEANUP_AFTER_REQUEST=true: Immediate cleanup after request completionMEMORY_USAGE_ALERT_THRESHOLD_GB=6.0: Auto force cleanup when memory exceeds 6GBAUTO_MODEL_UNLOAD_THRESHOLD_MINUTES=10: Auto unload model after 10 minutes of idlenessIDLE_MEMORY_CLEANUP_INTERVAL=120: Perform memory cleanup every 2 minutesIDLE_DEEP_CLEANUP_THRESHOLD=600: Perform deep cleanup after 10 minutes of idlenessIDLE_MONITORING_INTERVAL=30: Check idle status every 30 seconds
- A: The new version provides aggressive memory optimization strategies:
-
Q: How to solve the 8GB idle memory usage issue?
- A: Using the following environment variable configuration can significantly reduce idle memory:
ENABLE_AGGRESSIVE_IDLE_OPTIMIZATION=true MEMORY_USAGE_ALERT_THRESHOLD_GB=4.0 AUTO_MODEL_UNLOAD_THRESHOLD_MINUTES=5 IDLE_MEMORY_CLEANUP_INTERVAL=60 IMMEDIATE_CLEANUP_AFTER_REQUEST=true
- A: Using the following environment variable configuration can significantly reduce idle memory:
-
Q: Returned subtitles are too fragmented or flickering?
- A: Adjust
MIN_SUBTITLE_DURATION_SECONDS,SHORT_SUBTITLE_MERGE_MAX_GAP_SECONDS,SHORT_SUBTITLE_MIN_CHARS; or disableMERGE_SHORT_SUBTITLES=false.
- A: Adjust
-
Q: Port conflict?
- A: Modify the Compose
portsmapping, for example"18080:5092".
- A: Modify the Compose
-
Q: Permission issues (volume)?
- A: Can set
PUID/PGIDor ensure Docker Desktop shared disk permissions are correct. When encountering permission restrictions, deleting and rebuilding the volume directory can also help.
- A: Can set
- This project: see
LICENSE - Models and dependencies: NVIDIA NeMo (ASR), PyTorch, FFmpeg, Hugging Face and other open-source ecosystems
Completion and Verification
- Build: Docker/Compose manifests ready
- Run: Provides GPU/CPU dual paths and health checks
- Usage: Provides cross-platform commands and curl examples
- Coverage: Added user-friendly README (this file)