Skip to content

feat: Add nomic-embed-text-v2-moe support via Candle backend #227

@samvallad33

Description

@samvallad33

Summary

Nomic released nomic-embed-text-v2-moe in February 2025 — the first general-purpose Mixture of Experts embedding model. It's been out for a year, outperforms v1.5 on BEIR and MIRACL, supports ~100 languages, and keeps the same 768-dim Matryoshka output. No one has added it to a Rust embedding library yet.

I'd like to implement this for fastembed-rs and am happy to open a PR. Posting this issue first to align on the approach.

Why Candle, not ONNX

The v2-moe architecture uses dynamic expert routing (8 experts, top-2 per token) that cannot be cleanly exported to ONNX. The MoE gating layer calls .tolist() and uses Python control flow to dispatch tokens to experts — this gets baked as constants during JIT tracing, producing incorrect results on different inputs.

The only known ONNX workaround (documented here) runs all 8 experts unconditionally on every token, then masks. It works, but at ~4x the compute cost — defeating the purpose of MoE.

HuggingFace's TEI solved this the right way: a native Candle implementation with proper MoE routing (PR #596, merged April 2025).

Proposed approach

Follow the Qwen3 precedent (PR #216):

  1. New file src/models/nomic_v2_moe.rs implementing the full NomicBert+MoE architecture in candle-nn:
    • NomicBert embeddings + RoPE
    • Multi-head attention with QKV bias
    • Standard MLP (non-MoE layers)
    • MoE layer: linear router → softmax → top-2 selection → 8 expert MLPs → weighted sum
    • Alternating standard/MoE layers (MoE every 2nd layer)
    • Mean pooling + L2 normalization
  2. Feature-gated behind nomic-v2-moe (reuses existing candle deps from qwen3)
  3. Loads directly from safetensors on HuggingFace — no custom ONNX export needed
  4. Tests validated against PyTorch reference outputs (cosine similarity > 0.999)

Model specs

v1.5 (current) v2-moe (proposed)
Architecture Standard transformer MoE (8 experts, top-2)
Total params 137M 475M
Active params 137M 305M
Dimensions 768 (Matryoshka) 768 (Matryoshka)
Max context 8192 tokens 512 tokens
Languages English-focused ~100
HF format ONNX available Safetensors only

References

Scope

I'm prepared to implement this and open a PR. Wanted to check:

  1. Does the Candle approach align with where you want the library to go?
  2. Any preference on the feature flag naming (nomic-v2-moe, nomic-moe, etc.)?
  3. Should I target the same candle version pinned by the qwen3 feature?

Thanks for maintaining this library — it's the backbone of local embeddings in Rust.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions