Releases: lyonzin/knowledge-rag
v3.3.2 — Type Validation & Bounds Checking
Fixes
Full hardening of the YAML config loader after rigorous audit:
- Type validation on all config values — wrong types (string where int, string where list, int where bool) now warn and fall back to defaults
- Bounds validation — chunk_size (min 100), chunk_overlap (non-negative, < chunk_size), default_results, max_results, embedding_dim, reranker_top_k_multiplier
- keyword_routes string values detected and removed — previously
redteam: "pentest"caused character-level matching ("p","e","n", etc.) - reranker_enabled string coercion (
"yes"→Truewith warning) - supported_formats: [] falls back to defaults with warning
- Version synced across init.py, config.py, server.py, pyproject.toml
- Error handling in
knowledge-rag init(PermissionError, OSError) - Broken README anchor fixed
- Duplicate keyword removed
Upgrade
pip install --upgrade knowledge-ragv3.3.1 — Hotfix: YAML null safety + presets in pip
Fixes
- YAML null values no longer crash the server — Writing
category_mappings:(without a value) in config.yaml now safely falls back to defaults instead of crashing withTypeError: argument of type 'NoneType' is not iterable - Presets now included in pip install —
knowledge-rag initexports config template, all 4 presets, and creates adocuments/directory in the current folder
New
knowledge-rag initCLI command — One command to set up a fresh knowledge base:pip install knowledge-rag knowledge-rag init cp presets/developer.yaml config.yaml # Add your docs to documents/
Upgrade
pip install --upgrade knowledge-ragv3.3.0 — YAML Configuration System
What's New
YAML Configuration System
All settings are now customizable via config.yaml — no more editing Python code. Categories, keyword routing, query expansions, models, chunking, and paths are all configurable through a single YAML file.
Domain Presets
Four ready-to-use presets ship with the project:
| Preset | Categories | Keywords | Expansions | Best For |
|---|---|---|---|---|
| cybersecurity | 8 | 200+ | 69 | Red/Blue Team, CTFs, threat hunting |
| developer | 9 | 150+ | 50+ | Full-stack, APIs, DevOps, cloud |
| research | 9 | 100+ | 40+ | Academic papers, thesis, datasets |
| general | 0 | 0 | 0 | Blank slate, pure semantic search |
cp presets/developer.yaml config.yaml # Ready to goGeneric Use Support
With empty mappings ({}), the system operates as a domain-agnostic semantic search engine. No security-specific logic unless you want it.
Backwards Compatible
No config.yaml? The system uses built-in defaults — identical behavior to v3.2.x. Zero migration required.
Changes
- NEW: YAML configuration system — fully customizable via
config.yaml - NEW: Domain presets — cybersecurity, developer, research, general
- NEW:
config.example.yaml— documented template with explanations for every field - NEW: Categories, keyword routing, and query expansions now user-configurable
- NEW: Empty config = pure semantic search with zero domain logic
- NEW: Warning log for empty files during indexing (previously silent skip)
- IMPROVED: README rewritten — full configuration reference, preset docs, updated structure
- IMPROVED:
pyyamladded as dependency
Upgrade
git pull origin master
pip install pyyaml # New dependency
# Optionally: cp presets/cybersecurity.yaml config.yamlNo breaking changes. Existing installations work without any config file.
v3.2.4 — Symlink Support
What's New
- Symlink support —
documents/directory now follows symbolic links recursively (#13) - Circular symlink protection —
realpathdeduplication prevents infinite recursion loops - Stricter
_has_documents()detection — validates against supported formats only (ignores.gitkeep, temp files, etc.)
Changes
| File | Change |
|---|---|
mcp_server/config.py |
_has_documents() → os.walk(followlinks=True) + format filter |
mcp_server/ingestion.py |
parse_directory() → os.walk + seen_dirs loop protection |
Full Changelog: v3.2.3...v3.2.4
v3.2.3 — BASE_DIR smart detection for pip install
Fix
BASE_DIRnow checks for actual files insidedocuments/(not just directory existence)- Prevents false positive when
site-packages/documents/exists as empty dir - Supports
KNOWLEDGE_RAG_DIRenv var for explicit override
Upgrade
pip install --upgrade knowledge-ragv3.2.2 — pip install plug-and-play fix
Fixes
pip install knowledge-rag now truly plug-and-play
BASE_DIR was resolving to site-packages/ when installed from PyPI, causing documents/ to not be found. Now falls back to current working directory.
Supports KNOWLEDGE_RAG_DIR env var for explicit override.
category="aar" accepted by search_knowledge
The validator was rejecting aar as a category because it only checked keyword_routes keys. Now uses category_mappings values too.
Upgrade
pip install --upgrade knowledge-ragv3.2.1 — Auto-Recovery from Corrupted ChromaDB
Fix: Auto-Recovery on Startup
If ChromaDB gets corrupted (crash during indexing, power loss, etc.), the server now automatically detects and recovers instead of crashing with a segfault loop.
What was happening
- A crash during indexing left the SQLite DB in a corrupted state
- Next startup: segfault → crash → restart → segfault (infinite loop)
- Required manual deletion of
data/chroma_db/to fix
What happens now
- Server detects corruption on startup
- Automatically deletes corrupted data
- Recreates fresh collection
- Logs
[RECOVERY]messages so you know it happened - Zero manual intervention needed
Also handles
- Embedding function conflicts (e.g., switching models)
- Orphaned UUID directories from partial rebuilds
Upgrade
pip install --upgrade knowledge-ragv3.2.0 — Parallel Search + Adjacent Chunk Retrieval
New Features
Parallel BM25 + Semantic Search
Both search engines now run simultaneously in threads. ~50% latency reduction in hybrid mode.
Adjacent Chunk Retrieval
Matched chunks are automatically expanded with surrounding context. When a chunk matches your query, the system fetches the chunks immediately before and after it (from the same document) and merges them into a single expanded result.
- Results include
context_expanded: truewhen adjacent chunks were merged - Content grows from ~650 chars to ~1500 chars per result (more context for the LLM)
- Zero impact on retrieval precision — the matching still happens on the original chunk
Inspired by PrivateGPT's SentenceWindow pattern and Kotaemon's parallel retrieval.
Upgrade
pip install --upgrade knowledge-ragFull Changelog
v3.1.1 — Chunker Bugfix, AAR Category, CVE Aliases
Fixes
Markdown Chunker (critical quality fix)
- Code-block protection:
# commentsinside code fences no longer split as markdown headers - Split by
##/###only:#(H1) was catching shell comments and code — now ignored - Min chunk size 100 chars: Header-only chunks (32-53 chars of junk) now merge with next section
- Result: c2-operations doc goes from 32 chunks (12 junk) → 17 chunks (0 junk)
New
- AAR category:
documents/aar/maps to category "aar" (was "general") - 14 CVE aliases: PrintNightmare↔CVE-2021-34527, EternalBlue↔MS17-010, PwnKit↔CVE-2021-4034, Log4Shell↔CVE-2021-44228, ZeroLogon↔CVE-2020-1472, PetitPotam, CertiFried, noPac, ProxyLogon, ProxyShell
Upgrade
pip install --upgrade knowledge-ragAfter upgrade, run reindex_documents(full_rebuild=true) to reprocess all documents with the fixed chunker.
Full Changelog
v3.1.0 — DOCX/XLSX/PPTX/CSV, File Watcher, MMR
Knowledge RAG v3.1.0
New Features
Office Document Support (4 new formats)
- DOCX: Paragraphs, tables, heading structure preserved as markdown
- XLSX: All sheets extracted as searchable text tables
- PPTX: Slide-by-slide text extraction
- CSV: Native parsing, zero extra deps
- Total: 9 formats (was 5)
File Watcher
Documents directory monitored in real-time via watchdog. Auto-reindexes with 5-second debounce when you add, modify, or delete files.
MMR Result Diversification
Maximal Marginal Relevance applied after reranking. Reduces redundant results — if top 5 were from same doc, MMR pushes varied sources up. Lambda=0.7 (relevance-heavy).
pip install
pip install knowledge-ragNo clone needed. Models download automatically.