Skip to content

feat: hybrid search RAG (BM25 + vector) for all vector DB providers#5644

Open
Saswatsusmoy wants to merge 2 commits into
Mintplex-Labs:masterfrom
Saswatsusmoy:feat/hybrid-search-rag
Open

feat: hybrid search RAG (BM25 + vector) for all vector DB providers#5644
Saswatsusmoy wants to merge 2 commits into
Mintplex-Labs:masterfrom
Saswatsusmoy:feat/hybrid-search-rag

Conversation

@Saswatsusmoy
Copy link
Copy Markdown

Resolves #4338.

Summary

Adds opt-in hybrid retrieval combining BM25 keyword scoring with semantic vector similarity. Backward compatible — existing workspaces are unaffected unless they explicitly switch the search mode.

  • Strategy/orchestrator pattern: providers opt in to native hybrid via capabilities(); orchestrator falls back to a universal app-side BM25 strategy for providers without native support.
  • Native path: LanceDB (FTS index + vector, fused via Reciprocal Rank Fusion).
  • Universal fallback: works against any provider — Chroma, Pinecone, Qdrant, Weaviate, Milvus, PGVector, AstraDB, ChromaCloud, Zilliz all gain hybrid retrieval with no per-provider work.

Workspace settings

  • New vectorSearchMode option: \"hybrid\" (existing modes default and rerank are unchanged).
  • New hybridSearchAlpha (0..1) — weights semantic vs keyword. UI slider appears when Hybrid is selected.

Production hardening

  • Env-driven config (11 knobs prefixed HYBRID_SEARCH_*): pool sizing, fusion algorithm, RRF k, BM25 k1/b, cache size + TTL, log level, telemetry toggle.
  • BM25 LRU+TTL cache keyed on namespace + FNV-1a hash of candidate pool IDs. LanceDB invalidates on document add/delete; others rely on TTL.
  • Smart tokenizer — preserves URLs, snake/kebab/camel/Pascal identifiers, version strings (v2.1.0-beta), and long alphanumeric IDs (UUIDs, hashes).
  • Multilingual stopwords for en, es, fr, de, it, pt, nl, ru, zh, ja, ar — merged single set avoids language detection at query time.
  • Score calibration — sources carry score, hybridScore, denseScore, and sparseRank as separate fields. similarityThreshold filters on dense score, not the rank-based fused score, preserving existing operator intent.
  • Typed errorsHybridSearchError with code, context, and cause (Node-native error chaining). Structured logger with level filtering. Non-blocking telemetry events: hybrid_search_executed and hybrid_search_failed.

Schema

Adds workspaces.hybridSearchAlpha (REAL, default 0.5). Additive migration — no data backfill, safe to roll back.

Test plan

49 unit tests across 5 suites, all passing:

  • RRF fusion — single list ordering, multi-list boost, weight handling, k constant effect
  • Weighted fusion — alpha extremes, normalization across wildly different score scales, empty inputs
  • BM25 — unmatched queries, empty corpus, multi-term ranking, item reference preservation
  • Tokenizer — URL/identifier/version/long-ID preservation, multilingual stopword removal, unicode handling, custom stopword override
  • BM25 cache — order-invariant keys, namespace isolation, hit/miss/invalidation, hit rate stats
  • Orchestrator — strategy dispatch, native delegation, app-side fallback fusion, pool expansion semantics, post-fusion threshold, score field annotation, cache reuse across calls
  • Error handling — provider failures wrapped with context, original error preserved via cause, native path errors wrapped consistently

To verify locally:

  • cd server && yarn install && NODE_ENV=test npx jest __tests__/utils/HybridSearch/
  • cd server && yarn prisma:generate && yarn prisma:migrate to apply the additive migration
  • Open Workspace Settings → Vector Database → set Search Preference to Hybrid (Keyword + Semantic) and adjust the slider; verify retrieval pulls in keyword matches that pure semantic search misses

Architecture docs

server/utils/HybridSearch/README.md covers strategy contract, fusion semantics, config knobs, cache invalidation expectations, and how to add native hybrid for a new provider.

Follow-ups (intentionally out of scope)

  • Native hybrid implementations for Weaviate (hybrid operator), Qdrant (sparse + Query API), Milvus/Zilliz (BM25 sparse + dense), PGVector (tsvector + ts_rank), AstraDB (\$hybrid), and Pinecone (sparse-dense index — needs migration UX).
  • Cache invalidation hooks for non-LanceDB providers (TTL is the safety net today).
  • Shadow-mode rollout / A-B comparison tooling.

Adds opt-in hybrid retrieval combining BM25 keyword scoring with semantic
vector similarity. Resolves the issue (Mintplex-Labs#4338) request for hybrid search
support.

Architecture
- Strategy/orchestrator pattern: providers opt in to native hybrid via
  capabilities(); orchestrator falls back to a universal app-side BM25
  strategy for providers without native support.
- Native path implemented for LanceDB (FTS index + vector, fused via RRF).
- Universal fallback works against any provider — no per-provider work
  required to enable hybrid search across Chroma, Pinecone, Qdrant,
  Weaviate, Milvus, PGVector, AstraDB, ChromaCloud, Zilliz.

Workspace settings
- New vectorSearchMode option: "hybrid" (existing: default, rerank).
- New hybridSearchAlpha (0..1) — weights semantic vs keyword. UI slider
  appears when Hybrid is selected.
- Backward compatible: existing workspaces stay on "default", no behavior
  change unless opted in.

Production hardening
- 11 env-overridable knobs (HYBRID_SEARCH_*): pool sizing, fusion choice,
  RRF k, BM25 k1/b, cache size + TTL, log level, telemetry toggle.
- BM25 LRU+TTL cache keyed on namespace + FNV-1a hash of candidate pool.
  LanceDB invalidates on document add/delete; others rely on TTL.
- Custom tokenizer preserves URLs, snake/kebab/camel/Pascal identifiers,
  version strings (v2.1.0-beta), and long alphanumeric IDs.
- Multilingual stopwords (en, es, fr, de, it, pt, nl, ru, zh, ja, ar).
- Score calibration: sources carry score, hybridScore, denseScore, and
  sparseRank as separate fields. similarityThreshold filters on dense
  score, not the rank-based fused score.
- Typed HybridSearchError with code/context/cause; structured logger;
  non-blocking telemetry events (hybrid_search_executed/_failed).

Schema
- Adds workspaces.hybridSearchAlpha (REAL, default 0.5). Additive
  migration; no data backfill.

Tests
- 49 unit tests across 5 suites: RRF + weighted fusion math, BM25
  scoring + tokenization, multilingual stopwords, identifier preservation,
  orchestrator strategy dispatch, error wrapping, cache hit/miss
  semantics, namespace invalidation.

Follow-ups intentionally out of scope
- Native hybrid implementations for Weaviate, Qdrant, Milvus, PGVector,
  AstraDB, Pinecone sparse-dense.
- Cache invalidation hooks for non-LanceDB providers (TTL is the safety
  net today).
- Shadow-mode rollout / A-B comparison tooling.
Copilot AI review requested due to automatic review settings May 17, 2026 05:08
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds opt-in hybrid (BM25 keyword + vector) retrieval as a new vectorSearchMode = "hybrid" workspace option. Introduces a strategy/orchestrator module (server/utils/HybridSearch/*) that routes to a provider's native hybrid implementation when available (LanceDB) and otherwise to a universal app-side BM25 fallback over an expanded dense-search pool. All chat handlers are migrated from calling VectorDb.performSimilaritySearch directly to searchWorkspace(VectorDb, workspace, …).

Changes:

  • New HybridSearch module: RRF + weighted fusion, custom tokenizer/BM25, LRU+TTL BM25 cache, typed errors, structured logger, telemetry, env-driven config.
  • LanceDB gains a native performHybridSearch (FTS + vector fused via RRF) plus capabilities(); cache is invalidated on add/delete.
  • Schema/migration add workspaces.hybridSearchAlpha (REAL, default 0.5), workspace model exposes it as writable with clamp validation, and the workspace settings UI gets a slider when hybrid is selected.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
server/utils/HybridSearch/index.js Orchestrator: strategy selection, error wrapping, telemetry emission.
server/utils/HybridSearch/dispatch.js Single entry point reading workspace.vectorSearchMode.
server/utils/HybridSearch/config.js Env-driven runtime config getters.
server/utils/HybridSearch/cache.js Singleton LRU+TTL cache for BM25 indexes keyed by namespace + FNV1a of IDs.
server/utils/HybridSearch/errors.js HybridSearchError with code/context/cause.
server/utils/HybridSearch/logger.js Minimal level-filtered logger writing via console.log.
server/utils/HybridSearch/telemetry.js Lazy adapter to Telemetry model with fallback.
server/utils/HybridSearch/fusion/rrf.js Reciprocal Rank Fusion.
server/utils/HybridSearch/fusion/weighted.js Min-max normalized weighted fusion.
server/utils/HybridSearch/strategies/native.js Delegates to provider's performHybridSearch.
server/utils/HybridSearch/strategies/appSideBM25.js Universal hybrid fallback over expanded dense pool.
server/utils/HybridSearch/tokenizers/bm25.js Smart tokenizer + BM25 scoring.
server/utils/HybridSearch/tokenizers/stopwords.js Multilingual stopword set.
server/utils/HybridSearch/README.md Architecture, config, and extension docs.
server/utils/vectorDbProviders/base.js Adds capabilities() + default performHybridSearch throw.
server/utils/vectorDbProviders/lance/index.js Native hybrid implementation, FTS index ensure, cache invalidation hooks.
server/utils/chats/{stream,openaiCompatible,embed,apiChatHandler}.js, telegramBot/chat/stream.js, agents/aibitat/plugins/memory.js Migrated retrieval calls to searchWorkspace.
server/endpoints/api/workspace/index.js API search uses searchWorkspace.
server/models/workspace.js Adds hybridSearchAlpha writable + clamped validator; allows hybrid mode.
server/prisma/schema.prisma, server/prisma/migrations/20260517120000_hybrid_search/migration.sql Additive column with default 0.5.
frontend/src/utils/types.js Casts hybridSearchAlpha to float.
frontend/src/pages/WorkspaceSettings/.../VectorSearchMode/index.jsx Adds hybrid option + alpha slider; relaxes provider gating.
server/tests/utils/HybridSearch/{bm25,cache,fusion,orchestrator,tokenizer}.test.js New unit suites.
Files not reviewed (1)
  • server/utils/HybridSearch/tokenizers/stopwords.js: Language not supported
Comments suppressed due to low confidence (3)

server/utils/HybridSearch/strategies/appSideBM25.js:110

  • When metadata?.score is not a number (which is the case for several providers — e.g. Pinecone/Qdrant/Weaviate paths normalize differently, and any source without a numeric score in metadata), denseScore is set to null. The later filter if (dScore !== null && dScore < similarityThreshold) continue; then silently bypasses the operator's similarityThreshold entirely. That means switching to hybrid mode can effectively disable the threshold filter on any provider that does not embed a score field in source metadata, which contradicts the PR description ("similarityThreshold filters on dense score, not the rank-based fused score, preserving existing operator intent"). Consider treating missing denseScore as something that still gets compared (e.g. by relying on the dense search's own threshold filtering, or by inheriting the score from the upstream pool entry).
    const denseScore =
      typeof metadata?.score === "number" ? metadata.score : null;
    return {
      id: String(id),
      text,
      denseScore,
      denseRank: idx,
      source: src,
      contextText: dense.contextTexts[idx] ?? text,
    };
  });

  const denseRanked = pool.map((p) => ({
    id: p.id,
    score: p.denseScore ?? 1 - p.denseRank / pool.length,
    item: p,
  }));

  const bm25Build0 = Date.now();
  const bm25 = bm25Cache.getOrBuild(
    namespace,
    pool.map((p) => ({ id: p.id, text: p.text, item: p }))
  );
  const bm25BuildMs = Date.now() - bm25Build0;
  const sparseRanked = bm25.score(input);
  const sparseRankMap = new Map(sparseRanked.map((r, i) => [r.id, i]));

  const fuse0 = Date.now();
  const fused =
    fusion === "weighted"
      ? weightedFusion(denseRanked, sparseRanked, { alpha: hybridAlpha })
      : reciprocalRankFusion([denseRanked, sparseRanked], {
          weights: [hybridAlpha, 1 - hybridAlpha],
          k: config.rrfK,
        });
  const fuseMs = Date.now() - fuse0;

  const survivors = [];
  for (const { item, score } of fused) {
    const dScore = item.denseScore;
    if (dScore !== null && dScore < similarityThreshold) continue;
    survivors.push({ item, score });
    if (survivors.length >= topN) break;
  }

server/utils/HybridSearch/strategies/appSideBM25.js:89

  • When metadata.id/metadata.vectorId are absent, the pool falls back to pool-${idx} as an identity. This makes the BM25 cache key (which is derived from the sorted set of pool IDs in cache.js#buildKey) non-deterministic across calls: the same content in a different dense-rank order will produce the same key but bind different documents to pool-0, pool-1, ... So a cache hit will return a BM25 index whose id-to-document mapping no longer matches the current pool's documents, producing wrong rankings. Either skip caching when synthetic IDs are used, or derive a content-stable id (e.g. hash of text).
  const pool = dense.sources.map((src, idx) => {
    const metadata = src?.metadata ?? src;
    const text = metadata?.text ?? dense.contextTexts[idx] ?? "";
    const id = metadata?.id ?? metadata?.vectorId ?? `pool-${idx}`;
    const denseScore =
      typeof metadata?.score === "number" ? metadata.score : null;
    return {
      id: String(id),
      text,
      denseScore,
      denseRank: idx,
      source: src,
      contextText: dense.contextTexts[idx] ?? text,
    };
  });

  const denseRanked = pool.map((p) => ({
    id: p.id,
    score: p.denseScore ?? 1 - p.denseRank / pool.length,
    item: p,
  }));

  const bm25Build0 = Date.now();
  const bm25 = bm25Cache.getOrBuild(
    namespace,
    pool.map((p) => ({ id: p.id, text: p.text, item: p }))
  );

server/utils/HybridSearch/strategies/appSideBM25.js:58

  • For non-native-hybrid providers (Pinecone, Qdrant, Weaviate, Chroma, etc.), the app-side fallback issues a performSimilaritySearch with similarityThreshold: 0 and a pool of up to 100 (default) chunks. Most of those providers do an embedding round-trip per query — so flipping a workspace into hybrid mode silently widens every query's result set 5×–25× and may surface chunks the provider was previously hiding (irrelevant content, sensitive material previously below threshold). Worth either documenting this behavior explicitly in the workspace settings UI hint, or noting in the README that operators should re-validate their threshold semantics when enabling hybrid on non-LanceDB providers.
  const t0 = Date.now();
  const dense = await provider.performSimilaritySearch({
    namespace,
    input,
    LLMConnector,
    similarityThreshold: 0,
    topN: candidateTopN,
    filterIdentifiers,
  });
  const denseMs = Date.now() - t0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

});
} catch (e) {
// Index already exists or column missing — both are tolerable here.
if (!/already exists|exist/i.test(e?.message || "")) {
if (!namespace || !input || !LLMConnector)
throw new Error("Invalid request to appSideBM25Strategy.");

const candidateTopN = Math.min(poolMax, Math.max(topN, topN * poolMultiplier));
Comment on lines +120 to +137
const results = new Array(this.docs.length);
for (let i = 0; i < this.docs.length; i++) {
let score = 0;
const dl = this.docLengths[i];
const tf = this.termFreqs[i];
const lenNorm = 1 - this.b + this.b * (dl / this.avgDocLength);
for (const term of qTokens) {
const f = tf.get(term);
if (!f) continue;
const df = this.docFreqs.get(term) || 0;
const idf = Math.log(1 + (this.N - df + 0.5) / (df + 0.5));
const denom = f + this.k1 * lenNorm;
score += idf * ((f * (this.k1 + 1)) / denom);
}
results[i] = { id: this.docs[i].id, score, item: this.docs[i].item };
}

return results.filter((r) => r.score > 0).sort((a, b) => b.score - a.score);
for (const t of preserved) if (t.length >= minLength) out.push(t);
for (const t of rest) {
if (t.length < minLength) continue;
if (stopwords.has(t)) continue;
Comment on lines +65 to +98
{selection === "hybrid" && (
<div className="mt-3">
<label htmlFor="hybridSearchAlpha" className="block input-label">
Hybrid Weight (semantic ↔ keyword)
</label>
<div className="flex items-center gap-3 mt-2">
<span className="text-white text-opacity-60 text-xs w-20">
Keyword
</span>
<input
type="range"
min={0}
max={1}
step={0.05}
name="hybridSearchAlpha"
value={alpha}
onChange={(e) => {
setAlpha(parseFloat(e.target.value));
setHasChanges(true);
}}
className="flex-1"
/>
<span className="text-white text-opacity-60 text-xs w-20 text-right">
Semantic
</span>
<span className="text-white text-sm w-12 text-right">
{alpha.toFixed(2)}
</span>
</div>
<p className="text-white text-opacity-60 text-xs font-medium py-1.5">
0 = pure keyword (BM25). 1 = pure semantic (vector). 0.5 = balanced.
</p>
</div>
)}
Comment on lines +510 to +513
const fused = reciprocalRankFusion([denseRanked, sparseRanked], {
weights: [hybridAlpha, 1 - hybridAlpha],
k: hybridConfig.rrfK,
});
Comment on lines +30 to +33
if (Object.keys(fields).length === 0) {
console.log(`${tag} ${message}`);
} else {
console.log(`${tag} ${message}`, fields);
- ensureFTSIndex was passing {config: {type: "fts"}} which the SDK silently
  accepts but creates a non-inverted index. fullTextSearch then fails at
  query time with "Index is not an inverted index". Use the typed factory
  lancedb.Index.fts() instead. Verified end-to-end against a real LanceDB
  table — hybrid path now returns calibrated dense + sparse + fused scores.
- Apply prettier formatting flagged by eslint --fix across HybridSearch
  modules and stopwords list (no semantic changes).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEAT]: RAG with Hybrid Search

3 participants