This document provides technical reference documentation for the multi-analyzer indexing pipeline and multi-field weighted query pipeline used by the MCP Lucene Server.
For general feature overview and user-facing documentation, see README.md.
The MCP Lucene Server uses a multi-analyzer pipeline for indexing and a multi-field weighted query pipeline for searching. Documents are indexed with multiple shadow fields using different analyzers to optimize different search patterns (exact matching, leading wildcards, lemmatization, transliteration). At query time, searches combine these fields using weighted OR queries to balance precision and recall.
Before indexing, each document passes through the extraction pipeline:
- Apache Tika extraction: Tika extracts raw text content and metadata from PDFs, Office documents, and other supported formats
- Content normalization: The extracted text undergoes multiple normalization steps:
- NFKC Unicode normalization (full-width characters mapped to standard equivalents)
- Ligature expansion (PDF ligatures like fi, fl expanded to separate characters)
- Removal of invalid characters (U+FFFD replacement character, zero-width characters, control characters)
- Whitespace normalization (multiple spaces collapsed to single space)
- Language detection: Tika's OptimaizeLangDetector identifies the document language (ISO 639-1 code: en, de, fr, etc.)
Each document is indexed with the following fields:
| Field | Analyzer | Stored | Term Vectors | Faceted | Purpose |
|---|---|---|---|---|---|
file_path |
None (StringField) | Yes | No | No | Unique document ID |
file_name |
UnicodeNormalizingAnalyzer | Yes | No | No | Searchable filename |
content |
UnicodeNormalizingAnalyzer | Yes | Yes (positions+offsets) | No | Primary search + highlighting |
content_reversed |
ReverseUnicodeNormalizingAnalyzer | No | No | No | Leading wildcard optimization |
content_lemma_de |
OpenNLPLemmatizingAnalyzer (de) | No | No | No | German lemmatization |
content_lemma_en |
OpenNLPLemmatizingAnalyzer (en) | No | No | No | English lemmatization |
content_translit_de |
GermanTransliteratingAnalyzer | No | No | No | Umlaut digraph transliteration |
file_extension |
None (StringField) | Yes | No | Yes | Facet + filter |
file_type |
None (StringField) | Yes | No | Yes | Facet + filter |
language |
None (StringField) | Yes | No | Yes | Facet + filter |
author |
UnicodeNormalizingAnalyzer (TextField) | Yes | No | Yes | Facet + filter |
title |
UnicodeNormalizingAnalyzer | Yes | No | No | Searchable |
keywords |
UnicodeNormalizingAnalyzer | Yes | No | No | Searchable |
creator |
UnicodeNormalizingAnalyzer | Yes | No | No | Searchable |
subject |
UnicodeNormalizingAnalyzer | Yes | No | No | Searchable |
file_size |
LongPoint | Yes (StoredField) | No | No | Range queries |
created_date |
LongPoint | Yes (StoredField) | No | No | Range queries |
modified_date |
LongPoint | Yes (StoredField) | No | No | Range queries |
indexed_date |
LongPoint | Yes (StoredField) | No | No | Range queries |
content_hash |
None (StringField) | Yes | No | No | Change detection |
Important: ALL content-based shadow fields (content_reversed, content_lemma_de, content_lemma_en, content_translit_de) are indexed for ALL documents regardless of detected language. This enables mixed-language content matching and robust cross-language search.
Each analyzer applies a specific token processing chain to the input text.
Used for: content, file_name, title, keywords, author, creator, subject
Token chain:
StandardTokenizer → LowerCaseFilter → ICUFoldingFilter
What it does:
StandardTokenizer: Splits text into tokens using Unicode text segmentation rulesLowerCaseFilter: Converts all tokens to lowercaseICUFoldingFilter: Applies Unicode normalization (NFKC), diacritic folding, and ligature expansion
Examples:
| Input | Output Tokens |
|---|---|
"Der Vertrag wurde unterschrieben." |
[der, vertrag, wurde, unterschrieben] |
"Müller & Partner GmbH" |
[muller, partner, gmbh] (umlauts folded, & stripped by StandardTokenizer) |
"file_résumé.pdf" |
[file_resume, pdf] (diacritics removed) |
"café" |
[cafe] (accent removed) |
"naïve" |
[naive] (diaeresis removed) |
Used for: content_reversed
Token chain:
StandardTokenizer → LowerCaseFilter → ICUFoldingFilter → ReverseStringFilter
What it does:
- Applies the same normalization as UnicodeNormalizingAnalyzer, then reverses each token
Examples:
| Input | Output Tokens |
|---|---|
"Arbeitsvertrag" |
[gartrevstiebra] |
"Kaufvertrag" |
[gartrevfuak] |
"Mietvertrag" |
[gartrevteim] |
Why it exists: Enables efficient leading wildcard queries. *vertrag is internally rewritten as gartrev* on the content_reversed field, avoiding costly full-index scans.
Used for: content_lemma_de (German), content_lemma_en (English)
This analyzer has two modes with different tokenization strategies:
Sentence-aware mode (indexing): Uses OpenNLPTokenizer with sentence detection for accurate POS tagging on long texts
Token chain (indexing):
OpenNLPTokenizer(sentenceModel, tokenizerModel)
→ OpenNLPPOSFilter(posModel)
→ OpenNLPLemmatizerFilter(lemmatizerModel)
→ TypeTokenFilter(drop "." and "SYM")
→ CompoundLemmaSplittingFilter
→ LowerCaseFilter
→ ICUFoldingFilter
Simple mode (query time): Uses StandardTokenizer without sentence detection for short queries
Token chain (query time):
StandardTokenizer
→ OpenNLPPOSFilter(posModel)
→ OpenNLPLemmatizerFilter(lemmatizerModel)
→ TypeTokenFilter(drop "." and "SYM")
→ CompoundLemmaSplittingFilter
→ LowerCaseFilter
→ ICUFoldingFilter
Key pipeline details:
-
TypeTokenFilter: Drops punctuation tokens (OpenNLP POS type
"."= all punctuation) and symbol tokens (type"SYM"=&etc.). StandardTokenizer would strip these, but OpenNLPTokenizer retains them as separate tokens because it's trained on Universal Dependencies data. -
CompoundLemmaSplittingFilter: Splits German UD compound lemmas on
+. German contractions like "im" get lemmatized to "in+der", "zum" to "zu+der", "beim" to "bei+der". This filter splits them into individual searchable tokens. -
LowerCaseFilter + ICUFoldingFilter applied AFTER lemmatization: The lemmatizer requires original casing and POS tags to work correctly, so normalization comes last.
-
Independent caches per mode: Sentence-aware and query-time modes use separate lemmatizer caches because different POS tagging contexts can produce different lemmas for the same token.
German examples (sentence-aware mode):
| Input | Output Tokens |
|---|---|
"Der Vertrag wurde unterschrieben." |
[der, vertrag, werden, unterschreiben] |
"Die Häuser sind renoviert." |
[der, haus, sein, renovieren] |
"Er ging nach Hause." |
[er, gehen, nach, haus] |
"Er ist im Haus." |
[er, sein, in, der, haus] (compound "im"→"in+der" split) |
"Forschung & Entwicklung sind wichtig." |
[forschung, entwicklung, sein, wichtig] (& filtered as SYM) |
"R&D Abteilung" |
[r&d, abteilung] (embedded & preserved by tokenizer) |
English examples (sentence-aware mode):
| Input | Output Tokens |
|---|---|
"The contracts were signed." |
[the, contract, be, sign] |
"She ran to the bus." |
[she, run, to, the, bus] |
"Multiple analyses were performed." |
[multiple, analysis, be, perform] |
Query mode examples (single words):
| Input | Output Tokens |
|---|---|
"Vertrages" |
[vertrag] |
"Häuser" |
[haus] |
"ran" |
[run] |
"paid" |
[pay] |
Used for: content_translit_de
Token chain:
MappingCharFilter(ae→ä, oe→ö, ue→ü, Ae→Ä, Oe→Ö, Ue→Ü, AE→Ä, OE→Ö, UE→Ü)
→ StandardTokenizer
→ LowerCaseFilter
→ ICUFoldingFilter
What it does:
MappingCharFilter: Maps ASCII digraphs to German umlauts before tokenization- Then applies standard tokenization and normalization
Examples:
| Input | CharFilter Output | Final Tokens |
|---|---|---|
"Mueller" |
"Müller" |
[muller] |
"Müller" |
"Müller" (no change) |
[muller] |
"Kaese" |
"Käse" |
[kase] |
"Goethe" |
"Göthe" |
[gothe] |
"blue" |
"blü" (known false positive) |
[blu] (acceptable at low boost) |
Why it exists: Handles the German convention of writing umlauts as ASCII digraphs. Enables queries like "Mueller" to match documents containing "Müller".
Two PerFieldAnalyzerWrapper instances route fields to appropriate analyzers:
Index analyzer (sentence-aware OpenNLP for long texts):
Default analyzer: UnicodeNormalizingAnalyzer
Field routing:
content_reversed → ReverseUnicodeNormalizingAnalyzer
content_lemma_de → OpenNLPLemmatizingAnalyzer("de", sentenceAware=true)
content_lemma_en → OpenNLPLemmatizingAnalyzer("en", sentenceAware=true)
content_translit_de → GermanTransliteratingAnalyzer
(all other fields) → UnicodeNormalizingAnalyzer
Query analyzer (simple mode for short queries):
Default analyzer: UnicodeNormalizingAnalyzer
Field routing:
content_reversed → ReverseUnicodeNormalizingAnalyzer
content_lemma_de → OpenNLPLemmatizingAnalyzer("de", sentenceAware=false)
content_lemma_en → OpenNLPLemmatizingAnalyzer("en", sentenceAware=false)
content_translit_de → GermanTransliteratingAnalyzer
(all other fields) → UnicodeNormalizingAnalyzer
When a search request is received, it passes through the following steps:
ProximityExpandingQueryParser parses the query string targeting the content field. This parser extends Lucene's QueryParser with automatic phrase expansion and adaptive prefix scoring (see sections 3.3 and 3.4).
rewriteLeadingWildcards() normalizes and rewrites the parsed query:
-
Lowercasing: Wildcard/prefix terms are lowercased for fields in
LOWERCASE_WILDCARD_FIELDS(because Lucene's QueryParser does NOT apply the analyzer to wildcard terms, but the index stores lowercased tokens) -
Leading wildcard → reversed field: Queries on
contentfield:*vertrag→WildcardQuery("content_reversed", "gartrev*")*vertrag*→OR(content:*vertrag*, content_reversed:gartrev*)vertrag*→ unchanged (already efficient)
-
Recurses into BooleanQuery sub-clauses to rewrite nested wildcards
buildStemmedQuery() creates a weighted OR query combining multiple fields:
BooleanQuery(minShouldMatch=1):
content query (boost 2.0) — SHOULD
content_lemma_de query (dynamic boost) — SHOULD
content_lemma_en query (dynamic boost) — SHOULD
content_translit_de query (boost 0.5) — SHOULD
Field boosts:
- The
contentquery (unstemmed, exact) always gets the highest boost (2.0) - Lemma field boosts are computed dynamically from the language distribution in the index:
boost = 0.3 + 0.7 * (langCount / totalDocs). If 80% of documents are German, the German lemma field gets boost ~0.86. - The transliteration field always gets boost 0.5 regardless of language
- If a
language eq "xx"filter is present, only that language's lemma field is included (at boost 1.0) - Each shadow field query is also run through
rewriteLeadingWildcards()for consistent normalization
A separate unstemmed query on the content field is used for highlighting. This ensures **bold** markers appear on the exact matching terms in the stored content, not on stemmed/lemmatized forms.
Filters are classified and applied:
- Positive facet filters → DrillSideways (shows alternative facet values)
- Negative filters → MUST_NOT clauses
- Range filters → LongPoint range queries as FILTER clauses
- String term filters → TermQuery FILTER clauses
- LongPoint exact filters → LongPoint.newExactQuery FILTER clauses
UnifiedHighlighter with IndividualPassageFormatter produces markdown bold (**term**) highlighted passages from the stored content field.
| Query Type | Syntax | Example | Notes |
|---|---|---|---|
| Simple terms | word1 word2 |
contract signed |
Implicit AND between terms |
| Phrase query | "exact phrase" |
"signed contract" |
Preserves word order; auto-expanded (see 3.3) |
| Boolean AND | term1 AND term2 |
contract AND payment |
Both terms required |
| Boolean OR | term1 OR term2 |
contract OR agreement |
Either term matches |
| Boolean NOT | NOT term |
NOT draft |
Excludes documents with term |
| Trailing wildcard | prefix* |
contract* |
Matches contracts, contracting, etc. |
| Leading wildcard | *suffix |
*vertrag |
Optimized via reversed field |
| Infix wildcard | *infix* |
*vertrag* |
OR of forward + reversed |
| Single char wildcard | te?t |
te?t |
Matches test, text, etc. |
| Fuzzy search | term~N |
contract~2 |
Levenshtein edit distance (default: 2) |
| Proximity search | "term1 term2"~N |
"contract signed"~5 |
Terms within N words |
| Field-specific | field:term |
title:report |
Search specific field |
| Grouping | (a OR b) AND c |
(contract OR agreement) AND signed |
Logical grouping |
| Range (numeric) | field:[from TO to] |
modified_date:[1609459200000 TO *] |
Numeric range |
| Boost | term^N |
contract^3 |
Boost term importance |
When you search for an exact multi-word phrase, the ProximityExpandingQueryParser automatically expands it to include near-matches:
User query: "Domain Design"
Parsed as: ("Domain Design")^2.0 OR ("Domain Design"~3)
How it works:
- Exact phrase match gets a 2.0x boost → always ranked highest
- Proximity match allows up to 3 words between terms → catches variations
- Single-word phrases are NOT expanded (no benefit)
- If user specifies slop (
"Domain Design"~5), expansion is skipped (user intent honored)
Real-world example:
Query: "Domain Design"
Results by score:
1. "...Domain Design..." → Score: 0.698 (exact match, boosted)
2. "...Domain-driven Design..." → Score: 0.136 (1 word gap)
3. "...Domain Effective Design..." → Score: 0.136 (1 word gap)
4. "...Domain Very Effective Design..." → Score: 0.086 (2 word gap)
Configuration defaults:
DEFAULT_PROXIMITY_SLOP = 3DEFAULT_EXACT_BOOST = 2.0f
Important: Phrase slop and term order. Lucene's PhraseQuery with slop > 0 does allow out-of-order matching, but reversed terms cost more slop than in-order gaps. The edit distance per consecutive term pair is |actual_position_diff - expected_position_diff|. For example, with a query "Tom and Jerry":
| Document text | Edit distance | Matches (slop=3)? |
|---|---|---|
| "Tom and Jerry" | 0 | Yes (exact) |
| "Tom liked and admired Jerry" | 2 | Yes |
| "Tom Jerry and" | 3 | Yes (borderline) |
| "Jerry and Tom" (full reversal) | 4 | No |
A complete reversal of a 3-term phrase requires slop=4, exceeding the default slop=3. Two-term reversals (slop cost = 2) are within budget. This is acceptable because the AI client can issue non-phrase queries (Tom AND Jerry) when term order is irrelevant, and the multi-field weighted OR query provides recall for individual terms regardless of order.
Prefix queries use real BM25 scoring when the prefix is specific enough:
- >= 4 characters (
vertrag*,design*): BM25 scoring viaTopTermsBlendedFreqScoringRewrite(50). Shorter, more frequent terms rank higher. - < 4 characters (
ver*,de*): Constant score for performance (too many matching terms)
Example:
Query: vertrag* (>= 4 chars, BM25 scoring)
1. "vertrag" → Score: 2.8 (short, frequent)
2. "vertrags" → Score: 1.9
3. "vertragsklausel" → Score: 1.2 (long, rare)
Query: ver* (< 4 chars, constant score)
1. "verarbeiten" → Score: 1.0
2. "vertrag" → Score: 1.0
3. "vereinfachen" → Score: 1.0
Configuration:
MIN_PREFIX_LENGTH_FOR_SCORING = 4MAX_SCORED_PREFIX_TERMS = 50
German compound words can be found using wildcard patterns:
*vertrag→ finds Arbeitsvertrag, Kaufvertrag, Mietvertrag (via reversed field)vertrag*→ finds Vertragsbedingungen, Vertragsklausel (trailing wildcard)*vertrag*→ finds both (OR of forward + reversed)
Leading wildcards are optimized via content_reversed, executing as fast as trailing wildcards.
Dynamic boost formula: boost = 0.3 + 0.7 * (langCount / totalDocs)
Example with 1000 docs (800 German, 150 English, 50 other):
content → boost 2.0 (always fixed)
content_lemma_de → boost 0.3 + 0.7 * (800/1000) = 0.86
content_lemma_en → boost 0.3 + 0.7 * (150/1000) = 0.405
content_translit_de → boost 0.5 (always fixed)
With explicit language eq "de" filter:
content → boost 2.0
content_lemma_de → boost 1.0 (only this language included)
content_translit_de → boost 0.5
The multi-field approach affects ranking as follows:
Exact match on content: Hits both the content field (boost 2.0) AND lemma field(s) → highest combined score
Lemmatized match only: Document found only via lemma field (e.g., searching "Vertrag" finds "Vertrages") → lower score (lemma field boost only)
Transliteration match only: Document found via content_translit_de (e.g., "Mueller" finds "Müller") → lowest boost (0.5)
Highlighting: Always performed on the unstemmed content field using the highlight query. Documents matched only via stemmed fields get a fallback passage (no bold markers).
The server tracks the index schema version to detect when changes require reindexing:
DocumentIndexer.SCHEMA_VERSION(currently 8) is the single source of truth- Stored in Lucene commit user data (
schema_versionandsoftware_versionkeys) - On startup,
LuceneIndexService.init()detects mismatches → triggers automatic re-index - Any change to analyzer chains, field additions/removals, or indexing options requires a SCHEMA_VERSION bump
What triggers a schema version bump:
- Adding or removing indexed fields
- Changing field analyzers
- Modifying field indexing options (stored, term vectors, etc.)
Version information: Use the getIndexStats MCP tool to see the current schema version, software version, and build timestamp.
Each shadow field optimizes a specific search pattern:
content: Exact matching and highlightingcontent_reversed: Leading wildcard optimization (*vertrag)content_lemma_de/content_lemma_en: Morphological variant matching (run→ran, Haus→Häuser)content_translit_de: ASCII digraph to umlaut mapping (Mueller→Müller)
This multi-field approach allows combining multiple search strategies in a single query without sacrificing precision or performance.
Shadow fields are Store.NO because:
- They're used only for searching, never for highlighting
- Reduces index size (no duplicate stored content)
- Highlighting uses the original
contentfield for accurate term position markup
- Index time: Long document texts benefit from sentence-aware tokenization for accurate POS tagging
- Query time: Short query strings don't need sentence detection; StandardTokenizer is faster
- Both modes produce compatible tokens for matching, but with different tokenization strategies
Indexing throughput (approximate, 10-page PDF):
- Content extraction (Tika): ~100-200ms
- UnicodeNormalizingAnalyzer: ~5-10ms
- ReverseUnicodeNormalizingAnalyzer: ~5-10ms
- OpenNLPLemmatizingAnalyzer (sentence-aware): ~100-500ms
- GermanTransliteratingAnalyzer: ~5-10ms
Query processing (approximate):
- Query parsing: <1ms
- Leading wildcard rewriting: <1ms
- Stemmed query building: 1-5ms
- Search execution: 10-100ms (varies with index size and query complexity)
- Highlighting: 5-50ms (varies with passage count and document length)
For implementation details, see:
DocumentIndexer.java- Field schema and indexing logicLuceneIndexService.java- Query processing and search executionProximityExpandingQueryParser.java- Automatic phrase expansion- Analyzer implementations in
src/main/java/com/bitplan/lucene/analyzer/