feat: hybrid search RAG (BM25 + vector) for all vector DB providers by Saswatsusmoy · Pull Request #5644 · Mintplex-Labs/anything-llm

Saswatsusmoy · 2026-05-17T05:08:48Z

Resolves #4338.

Summary

Adds opt-in hybrid retrieval combining BM25 keyword scoring with semantic vector similarity. Backward compatible — existing workspaces are unaffected unless they explicitly switch the search mode.

Strategy/orchestrator pattern: providers opt in to native hybrid via capabilities(); orchestrator falls back to a universal app-side BM25 strategy for providers without native support.
Native path: LanceDB (FTS index + vector, fused via Reciprocal Rank Fusion).
Universal fallback: works against any provider — Chroma, Pinecone, Qdrant, Weaviate, Milvus, PGVector, AstraDB, ChromaCloud, Zilliz all gain hybrid retrieval with no per-provider work.

Workspace settings

New vectorSearchMode option: \"hybrid\" (existing modes default and rerank are unchanged).
New hybridSearchAlpha (0..1) — weights semantic vs keyword. UI slider appears when Hybrid is selected.

Production hardening

Env-driven config (11 knobs prefixed HYBRID_SEARCH_*): pool sizing, fusion algorithm, RRF k, BM25 k1/b, cache size + TTL, log level, telemetry toggle.
BM25 LRU+TTL cache keyed on namespace + FNV-1a hash of candidate pool IDs. LanceDB invalidates on document add/delete; others rely on TTL.
Smart tokenizer — preserves URLs, snake/kebab/camel/Pascal identifiers, version strings (v2.1.0-beta), and long alphanumeric IDs (UUIDs, hashes).
Multilingual stopwords for en, es, fr, de, it, pt, nl, ru, zh, ja, ar — merged single set avoids language detection at query time.
Score calibration — sources carry score, hybridScore, denseScore, and sparseRank as separate fields. similarityThreshold filters on dense score, not the rank-based fused score, preserving existing operator intent.
Typed errors — HybridSearchError with code, context, and cause (Node-native error chaining). Structured logger with level filtering. Non-blocking telemetry events: hybrid_search_executed and hybrid_search_failed.

Schema

Adds workspaces.hybridSearchAlpha (REAL, default 0.5). Additive migration — no data backfill, safe to roll back.

Test plan

49 unit tests across 5 suites, all passing:

RRF fusion — single list ordering, multi-list boost, weight handling, k constant effect
Weighted fusion — alpha extremes, normalization across wildly different score scales, empty inputs
BM25 — unmatched queries, empty corpus, multi-term ranking, item reference preservation
Tokenizer — URL/identifier/version/long-ID preservation, multilingual stopword removal, unicode handling, custom stopword override
BM25 cache — order-invariant keys, namespace isolation, hit/miss/invalidation, hit rate stats
Orchestrator — strategy dispatch, native delegation, app-side fallback fusion, pool expansion semantics, post-fusion threshold, score field annotation, cache reuse across calls
Error handling — provider failures wrapped with context, original error preserved via cause, native path errors wrapped consistently

To verify locally:

cd server && yarn install && NODE_ENV=test npx jest __tests__/utils/HybridSearch/
cd server && yarn prisma:generate && yarn prisma:migrate to apply the additive migration
Open Workspace Settings → Vector Database → set Search Preference to Hybrid (Keyword + Semantic) and adjust the slider; verify retrieval pulls in keyword matches that pure semantic search misses

Architecture docs

server/utils/HybridSearch/README.md covers strategy contract, fusion semantics, config knobs, cache invalidation expectations, and how to add native hybrid for a new provider.

Follow-ups (intentionally out of scope)

Native hybrid implementations for Weaviate (hybrid operator), Qdrant (sparse + Query API), Milvus/Zilliz (BM25 sparse + dense), PGVector (tsvector + ts_rank), AstraDB (\$hybrid), and Pinecone (sparse-dense index — needs migration UX).
Cache invalidation hooks for non-LanceDB providers (TTL is the safety net today).
Shadow-mode rollout / A-B comparison tooling.

Adds opt-in hybrid retrieval combining BM25 keyword scoring with semantic vector similarity. Resolves the issue (Mintplex-Labs#4338) request for hybrid search support. Architecture - Strategy/orchestrator pattern: providers opt in to native hybrid via capabilities(); orchestrator falls back to a universal app-side BM25 strategy for providers without native support. - Native path implemented for LanceDB (FTS index + vector, fused via RRF). - Universal fallback works against any provider — no per-provider work required to enable hybrid search across Chroma, Pinecone, Qdrant, Weaviate, Milvus, PGVector, AstraDB, ChromaCloud, Zilliz. Workspace settings - New vectorSearchMode option: "hybrid" (existing: default, rerank). - New hybridSearchAlpha (0..1) — weights semantic vs keyword. UI slider appears when Hybrid is selected. - Backward compatible: existing workspaces stay on "default", no behavior change unless opted in. Production hardening - 11 env-overridable knobs (HYBRID_SEARCH_*): pool sizing, fusion choice, RRF k, BM25 k1/b, cache size + TTL, log level, telemetry toggle. - BM25 LRU+TTL cache keyed on namespace + FNV-1a hash of candidate pool. LanceDB invalidates on document add/delete; others rely on TTL. - Custom tokenizer preserves URLs, snake/kebab/camel/Pascal identifiers, version strings (v2.1.0-beta), and long alphanumeric IDs. - Multilingual stopwords (en, es, fr, de, it, pt, nl, ru, zh, ja, ar). - Score calibration: sources carry score, hybridScore, denseScore, and sparseRank as separate fields. similarityThreshold filters on dense score, not the rank-based fused score. - Typed HybridSearchError with code/context/cause; structured logger; non-blocking telemetry events (hybrid_search_executed/_failed). Schema - Adds workspaces.hybridSearchAlpha (REAL, default 0.5). Additive migration; no data backfill. Tests - 49 unit tests across 5 suites: RRF + weighted fusion math, BM25 scoring + tokenization, multilingual stopwords, identifier preservation, orchestrator strategy dispatch, error wrapping, cache hit/miss semantics, namespace invalidation. Follow-ups intentionally out of scope - Native hybrid implementations for Weaviate, Qdrant, Milvus, PGVector, AstraDB, Pinecone sparse-dense. - Cache invalidation hooks for non-LanceDB providers (TTL is the safety net today). - Shadow-mode rollout / A-B comparison tooling.

Copilot

Pull request overview

Adds opt-in hybrid (BM25 keyword + vector) retrieval as a new vectorSearchMode = "hybrid" workspace option. Introduces a strategy/orchestrator module (server/utils/HybridSearch/*) that routes to a provider's native hybrid implementation when available (LanceDB) and otherwise to a universal app-side BM25 fallback over an expanded dense-search pool. All chat handlers are migrated from calling VectorDb.performSimilaritySearch directly to searchWorkspace(VectorDb, workspace, …).

Changes:

New HybridSearch module: RRF + weighted fusion, custom tokenizer/BM25, LRU+TTL BM25 cache, typed errors, structured logger, telemetry, env-driven config.
LanceDB gains a native performHybridSearch (FTS + vector fused via RRF) plus capabilities(); cache is invalidated on add/delete.
Schema/migration add workspaces.hybridSearchAlpha (REAL, default 0.5), workspace model exposes it as writable with clamp validation, and the workspace settings UI gets a slider when hybrid is selected.

Reviewed changes

Copilot reviewed 32 out of 33 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
server/utils/HybridSearch/index.js	Orchestrator: strategy selection, error wrapping, telemetry emission.
server/utils/HybridSearch/dispatch.js	Single entry point reading `workspace.vectorSearchMode`.
server/utils/HybridSearch/config.js	Env-driven runtime config getters.
server/utils/HybridSearch/cache.js	Singleton LRU+TTL cache for BM25 indexes keyed by namespace + FNV1a of IDs.
server/utils/HybridSearch/errors.js	`HybridSearchError` with code/context/cause.
server/utils/HybridSearch/logger.js	Minimal level-filtered logger writing via `console.log`.
server/utils/HybridSearch/telemetry.js	Lazy adapter to Telemetry model with fallback.
server/utils/HybridSearch/fusion/rrf.js	Reciprocal Rank Fusion.
server/utils/HybridSearch/fusion/weighted.js	Min-max normalized weighted fusion.
server/utils/HybridSearch/strategies/native.js	Delegates to provider's `performHybridSearch`.
server/utils/HybridSearch/strategies/appSideBM25.js	Universal hybrid fallback over expanded dense pool.
server/utils/HybridSearch/tokenizers/bm25.js	Smart tokenizer + BM25 scoring.
server/utils/HybridSearch/tokenizers/stopwords.js	Multilingual stopword set.
server/utils/HybridSearch/README.md	Architecture, config, and extension docs.
server/utils/vectorDbProviders/base.js	Adds `capabilities()` + default `performHybridSearch` throw.
server/utils/vectorDbProviders/lance/index.js	Native hybrid implementation, FTS index ensure, cache invalidation hooks.
server/utils/chats/{stream,openaiCompatible,embed,apiChatHandler}.js, telegramBot/chat/stream.js, agents/aibitat/plugins/memory.js	Migrated retrieval calls to `searchWorkspace`.
server/endpoints/api/workspace/index.js	API search uses `searchWorkspace`.
server/models/workspace.js	Adds `hybridSearchAlpha` writable + clamped validator; allows `hybrid` mode.
server/prisma/schema.prisma, server/prisma/migrations/20260517120000_hybrid_search/migration.sql	Additive column with default 0.5.
frontend/src/utils/types.js	Casts `hybridSearchAlpha` to float.
frontend/src/pages/WorkspaceSettings/.../VectorSearchMode/index.jsx	Adds hybrid option + alpha slider; relaxes provider gating.
server/tests/utils/HybridSearch/{bm25,cache,fusion,orchestrator,tokenizer}.test.js	New unit suites.

Files not reviewed (1)

server/utils/HybridSearch/tokenizers/stopwords.js: Language not supported

Comments suppressed due to low confidence (3)

server/utils/HybridSearch/strategies/appSideBM25.js:110

When metadata?.score is not a number (which is the case for several providers — e.g. Pinecone/Qdrant/Weaviate paths normalize differently, and any source without a numeric score in metadata), denseScore is set to null. The later filter if (dScore !== null && dScore < similarityThreshold) continue; then silently bypasses the operator's similarityThreshold entirely. That means switching to hybrid mode can effectively disable the threshold filter on any provider that does not embed a score field in source metadata, which contradicts the PR description ("similarityThreshold filters on dense score, not the rank-based fused score, preserving existing operator intent"). Consider treating missing denseScore as something that still gets compared (e.g. by relying on the dense search's own threshold filtering, or by inheriting the score from the upstream pool entry).

    const denseScore =
      typeof metadata?.score === "number" ? metadata.score : null;
    return {
      id: String(id),
      text,
      denseScore,
      denseRank: idx,
      source: src,
      contextText: dense.contextTexts[idx] ?? text,
    };
  });

  const denseRanked = pool.map((p) => ({
    id: p.id,
    score: p.denseScore ?? 1 - p.denseRank / pool.length,
    item: p,
  }));

  const bm25Build0 = Date.now();
  const bm25 = bm25Cache.getOrBuild(
    namespace,
    pool.map((p) => ({ id: p.id, text: p.text, item: p }))
  );
  const bm25BuildMs = Date.now() - bm25Build0;
  const sparseRanked = bm25.score(input);
  const sparseRankMap = new Map(sparseRanked.map((r, i) => [r.id, i]));

  const fuse0 = Date.now();
  const fused =
    fusion === "weighted"
      ? weightedFusion(denseRanked, sparseRanked, { alpha: hybridAlpha })
      : reciprocalRankFusion([denseRanked, sparseRanked], {
          weights: [hybridAlpha, 1 - hybridAlpha],
          k: config.rrfK,
        });
  const fuseMs = Date.now() - fuse0;

  const survivors = [];
  for (const { item, score } of fused) {
    const dScore = item.denseScore;
    if (dScore !== null && dScore < similarityThreshold) continue;
    survivors.push({ item, score });
    if (survivors.length >= topN) break;
  }

server/utils/HybridSearch/strategies/appSideBM25.js:89

When metadata.id/metadata.vectorId are absent, the pool falls back to pool-${idx} as an identity. This makes the BM25 cache key (which is derived from the sorted set of pool IDs in cache.js#buildKey) non-deterministic across calls: the same content in a different dense-rank order will produce the same key but bind different documents to pool-0, pool-1, ... So a cache hit will return a BM25 index whose id-to-document mapping no longer matches the current pool's documents, producing wrong rankings. Either skip caching when synthetic IDs are used, or derive a content-stable id (e.g. hash of text).

  const pool = dense.sources.map((src, idx) => {
    const metadata = src?.metadata ?? src;
    const text = metadata?.text ?? dense.contextTexts[idx] ?? "";
    const id = metadata?.id ?? metadata?.vectorId ?? `pool-${idx}`;
    const denseScore =
      typeof metadata?.score === "number" ? metadata.score : null;
    return {
      id: String(id),
      text,
      denseScore,
      denseRank: idx,
      source: src,
      contextText: dense.contextTexts[idx] ?? text,
    };
  });

  const denseRanked = pool.map((p) => ({
    id: p.id,
    score: p.denseScore ?? 1 - p.denseRank / pool.length,
    item: p,
  }));

  const bm25Build0 = Date.now();
  const bm25 = bm25Cache.getOrBuild(
    namespace,
    pool.map((p) => ({ id: p.id, text: p.text, item: p }))
  );

server/utils/HybridSearch/strategies/appSideBM25.js:58

For non-native-hybrid providers (Pinecone, Qdrant, Weaviate, Chroma, etc.), the app-side fallback issues a performSimilaritySearch with similarityThreshold: 0 and a pool of up to 100 (default) chunks. Most of those providers do an embedding round-trip per query — so flipping a workspace into hybrid mode silently widens every query's result set 5×–25× and may surface chunks the provider was previously hiding (irrelevant content, sensitive material previously below threshold). Worth either documenting this behavior explicitly in the workspace settings UI hint, or noting in the README that operators should re-validate their threshold semantics when enabling hybrid on non-LanceDB providers.

  const t0 = Date.now();
  const dense = await provider.performSimilaritySearch({
    namespace,
    input,
    LLMConnector,
    similarityThreshold: 0,
    topN: candidateTopN,
    filterIdentifiers,
  });
  const denseMs = Date.now() - t0;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      });
+    } catch (e) {
+      // Index already exists or column missing — both are tolerable here.
+      if (!/already exists|exist/i.test(e?.message || "")) {


+  if (!namespace || !input || !LLMConnector)
+    throw new Error("Invalid request to appSideBM25Strategy.");
+
+  const candidateTopN = Math.min(poolMax, Math.max(topN, topN * poolMultiplier));


+    const results = new Array(this.docs.length);
+    for (let i = 0; i < this.docs.length; i++) {
+      let score = 0;
+      const dl = this.docLengths[i];
+      const tf = this.termFreqs[i];
+      const lenNorm = 1 - this.b + this.b * (dl / this.avgDocLength);
+      for (const term of qTokens) {
+        const f = tf.get(term);
+        if (!f) continue;
+        const df = this.docFreqs.get(term) || 0;
+        const idf = Math.log(1 + (this.N - df + 0.5) / (df + 0.5));
+        const denom = f + this.k1 * lenNorm;
+        score += idf * ((f * (this.k1 + 1)) / denom);
+      }
+      results[i] = { id: this.docs[i].id, score, item: this.docs[i].item };
+    }
+
+    return results.filter((r) => r.score > 0).sort((a, b) => b.score - a.score);


+  for (const t of preserved) if (t.length >= minLength) out.push(t);
+  for (const t of rest) {
+    if (t.length < minLength) continue;
+    if (stopwords.has(t)) continue;


+      {selection === "hybrid" && (
+        <div className="mt-3">
+          <label htmlFor="hybridSearchAlpha" className="block input-label">
+            Hybrid Weight (semantic ↔ keyword)
+          </label>
+          <div className="flex items-center gap-3 mt-2">
+            <span className="text-white text-opacity-60 text-xs w-20">
+              Keyword
+            </span>
+            <input
+              type="range"
+              min={0}
+              max={1}
+              step={0.05}
+              name="hybridSearchAlpha"
+              value={alpha}
+              onChange={(e) => {
+                setAlpha(parseFloat(e.target.value));
+                setHasChanges(true);
+              }}
+              className="flex-1"
+            />
+            <span className="text-white text-opacity-60 text-xs w-20 text-right">
+              Semantic
+            </span>
+            <span className="text-white text-sm w-12 text-right">
+              {alpha.toFixed(2)}
+            </span>
+          </div>
+          <p className="text-white text-opacity-60 text-xs font-medium py-1.5">
+            0 = pure keyword (BM25). 1 = pure semantic (vector). 0.5 = balanced.
+          </p>
+        </div>
+      )}


+    const fused = reciprocalRankFusion([denseRanked, sparseRanked], {
+      weights: [hybridAlpha, 1 - hybridAlpha],
+      k: hybridConfig.rrfK,
+    });


+  if (Object.keys(fields).length === 0) {
+    console.log(`${tag} ${message}`);
+  } else {
+    console.log(`${tag} ${message}`, fields);


- ensureFTSIndex was passing {config: {type: "fts"}} which the SDK silently accepts but creates a non-inverted index. fullTextSearch then fails at query time with "Index is not an inverted index". Use the typed factory lancedb.Index.fts() instead. Verified end-to-end against a real LanceDB table — hybrid path now returns calibrated dense + sparse + fused scores. - Apply prettier formatting flagged by eslint --fix across HybridSearch modules and stopwords list (no semantic changes).

Copilot AI review requested due to automatic review settings May 17, 2026 05:08

Copilot started reviewing on behalf of Saswatsusmoy May 17, 2026 05:09 View session

Copilot AI reviewed May 17, 2026

View reviewed changes

timothycarambat added the blocked label May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: hybrid search RAG (BM25 + vector) for all vector DB providers#5644

feat: hybrid search RAG (BM25 + vector) for all vector DB providers#5644
Saswatsusmoy wants to merge 2 commits into
Mintplex-Labs:masterfrom
Saswatsusmoy:feat/hybrid-search-rag

Saswatsusmoy commented May 17, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

Saswatsusmoy commented May 17, 2026

Summary

Workspace settings

Production hardening

Schema

Test plan

Architecture docs

Follow-ups (intentionally out of scope)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants