Summary
Nomic released nomic-embed-text-v2-moe in February 2025 — the first general-purpose Mixture of Experts embedding model. It's been out for a year, outperforms v1.5 on BEIR and MIRACL, supports ~100 languages, and keeps the same 768-dim Matryoshka output. No one has added it to a Rust embedding library yet.
I'd like to implement this for fastembed-rs and am happy to open a PR. Posting this issue first to align on the approach.
Why Candle, not ONNX
The v2-moe architecture uses dynamic expert routing (8 experts, top-2 per token) that cannot be cleanly exported to ONNX. The MoE gating layer calls .tolist() and uses Python control flow to dispatch tokens to experts — this gets baked as constants during JIT tracing, producing incorrect results on different inputs.
The only known ONNX workaround (documented here) runs all 8 experts unconditionally on every token, then masks. It works, but at ~4x the compute cost — defeating the purpose of MoE.
HuggingFace's TEI solved this the right way: a native Candle implementation with proper MoE routing (PR #596, merged April 2025).
Proposed approach
Follow the Qwen3 precedent (PR #216):
- New file
src/models/nomic_v2_moe.rs implementing the full NomicBert+MoE architecture in candle-nn:
- NomicBert embeddings + RoPE
- Multi-head attention with QKV bias
- Standard MLP (non-MoE layers)
- MoE layer: linear router → softmax → top-2 selection → 8 expert MLPs → weighted sum
- Alternating standard/MoE layers (MoE every 2nd layer)
- Mean pooling + L2 normalization
- Feature-gated behind
nomic-v2-moe (reuses existing candle deps from qwen3)
- Loads directly from safetensors on HuggingFace — no custom ONNX export needed
- Tests validated against PyTorch reference outputs (cosine similarity > 0.999)
Model specs
|
v1.5 (current) |
v2-moe (proposed) |
| Architecture |
Standard transformer |
MoE (8 experts, top-2) |
| Total params |
137M |
475M |
| Active params |
137M |
305M |
| Dimensions |
768 (Matryoshka) |
768 (Matryoshka) |
| Max context |
8192 tokens |
512 tokens |
| Languages |
English-focused |
~100 |
| HF format |
ONNX available |
Safetensors only |
References
Scope
I'm prepared to implement this and open a PR. Wanted to check:
- Does the Candle approach align with where you want the library to go?
- Any preference on the feature flag naming (
nomic-v2-moe, nomic-moe, etc.)?
- Should I target the same candle version pinned by the
qwen3 feature?
Thanks for maintaining this library — it's the backbone of local embeddings in Rust.
Summary
Nomic released nomic-embed-text-v2-moe in February 2025 — the first general-purpose Mixture of Experts embedding model. It's been out for a year, outperforms v1.5 on BEIR and MIRACL, supports ~100 languages, and keeps the same 768-dim Matryoshka output. No one has added it to a Rust embedding library yet.
I'd like to implement this for fastembed-rs and am happy to open a PR. Posting this issue first to align on the approach.
Why Candle, not ONNX
The v2-moe architecture uses dynamic expert routing (8 experts, top-2 per token) that cannot be cleanly exported to ONNX. The MoE gating layer calls
.tolist()and uses Python control flow to dispatch tokens to experts — this gets baked as constants during JIT tracing, producing incorrect results on different inputs.The only known ONNX workaround (documented here) runs all 8 experts unconditionally on every token, then masks. It works, but at ~4x the compute cost — defeating the purpose of MoE.
HuggingFace's TEI solved this the right way: a native Candle implementation with proper MoE routing (PR #596, merged April 2025).
Proposed approach
Follow the Qwen3 precedent (PR #216):
src/models/nomic_v2_moe.rsimplementing the full NomicBert+MoE architecture in candle-nn:nomic-v2-moe(reuses existing candle deps fromqwen3)Model specs
References
Scope
I'm prepared to implement this and open a PR. Wanted to check:
nomic-v2-moe,nomic-moe, etc.)?qwen3feature?Thanks for maintaining this library — it's the backbone of local embeddings in Rust.