Summary
Upstream qwen3-embed is releasing v2.0.0 with two major changes that affect this project:
- Smaller embedding model:
tiny-embed-v1 (~60MB ONNX INT8) replaces Qwen3-Embedding-0.6B (573MB) — ~10x smaller, faster inference, multilingual 50+ languages + code support
- New local reranking:
tiny-reranker-v1 (~70-90MB ONNX INT8) — cross-encoder reranker that runs entirely locally via ONNX Runtime
Impact on mnemo-mcp
Embedding (existing)
- Model download size drops from ~573MB to ~60MB
- Memory search quality maintained (target >= 90% of Qwen3-Embedding-8B teacher)
- Output dimension remains 768 — backward compatible with existing memory indices
- First-launch and cold-start times significantly reduced
Reranking (new capability)
- Memory retrieval can now be reranked locally before returning results
- Cross-encoder attention between query and memory content provides better relevance scoring
- Particularly useful for distinguishing between semantically similar but contextually different memories
- Zero network dependency for the entire memory search flow
What needs to change
- Bump
qwen3-embed dependency to >=2.0.0
- Integrate reranking into memory search (after vector retrieval, before returning results)
- Test with multilingual memory content and code-related memories
- Verify embedding quality hasn't regressed
- Consider if reranking should be opt-in or default (latency trade-off: ~5-10ms per rerank call for 10 documents)
Technical details
Both models share the same backbone (mDeBERTa-v3-base pruned 6L, vocabulary pruned 250K → 64K tokens). Distilled from Qwen3-Embedding-8B (embed) and Qwen3-Reranker-8B (reranker). Apache-2.0 licensed, auto-downloaded from HuggingFace Hub.
Timeline
Blocked on qwen3-embed v2.0.0 release.
Summary
Upstream
qwen3-embedis releasing v2.0.0 with two major changes that affect this project:tiny-embed-v1(~60MB ONNX INT8) replacesQwen3-Embedding-0.6B(573MB) — ~10x smaller, faster inference, multilingual 50+ languages + code supporttiny-reranker-v1(~70-90MB ONNX INT8) — cross-encoder reranker that runs entirely locally via ONNX RuntimeImpact on mnemo-mcp
Embedding (existing)
Reranking (new capability)
What needs to change
qwen3-embeddependency to>=2.0.0Technical details
Both models share the same backbone (mDeBERTa-v3-base pruned 6L, vocabulary pruned 250K → 64K tokens). Distilled from Qwen3-Embedding-8B (embed) and Qwen3-Reranker-8B (reranker). Apache-2.0 licensed, auto-downloaded from HuggingFace Hub.
Timeline
Blocked on
qwen3-embedv2.0.0 release.