Skip to content

Releases: lyonzin/knowledge-rag

v3.3.2 — Type Validation & Bounds Checking

06 Apr 02:22

Choose a tag to compare

Fixes

Full hardening of the YAML config loader after rigorous audit:

  • Type validation on all config values — wrong types (string where int, string where list, int where bool) now warn and fall back to defaults
  • Bounds validation — chunk_size (min 100), chunk_overlap (non-negative, < chunk_size), default_results, max_results, embedding_dim, reranker_top_k_multiplier
  • keyword_routes string values detected and removed — previously redteam: "pentest" caused character-level matching ("p", "e", "n", etc.)
  • reranker_enabled string coercion ("yes"True with warning)
  • supported_formats: [] falls back to defaults with warning
  • Version synced across init.py, config.py, server.py, pyproject.toml
  • Error handling in knowledge-rag init (PermissionError, OSError)
  • Broken README anchor fixed
  • Duplicate keyword removed

Upgrade

pip install --upgrade knowledge-rag

v3.3.1 — Hotfix: YAML null safety + presets in pip

06 Apr 02:03

Choose a tag to compare

Fixes

  • YAML null values no longer crash the server — Writing category_mappings: (without a value) in config.yaml now safely falls back to defaults instead of crashing with TypeError: argument of type 'NoneType' is not iterable
  • Presets now included in pip installknowledge-rag init exports config template, all 4 presets, and creates a documents/ directory in the current folder

New

  • knowledge-rag init CLI command — One command to set up a fresh knowledge base:
    pip install knowledge-rag
    knowledge-rag init
    cp presets/developer.yaml config.yaml
    # Add your docs to documents/

Upgrade

pip install --upgrade knowledge-rag

v3.3.0 — YAML Configuration System

06 Apr 01:33

Choose a tag to compare

What's New

YAML Configuration System

All settings are now customizable via config.yaml — no more editing Python code. Categories, keyword routing, query expansions, models, chunking, and paths are all configurable through a single YAML file.

Domain Presets

Four ready-to-use presets ship with the project:

Preset Categories Keywords Expansions Best For
cybersecurity 8 200+ 69 Red/Blue Team, CTFs, threat hunting
developer 9 150+ 50+ Full-stack, APIs, DevOps, cloud
research 9 100+ 40+ Academic papers, thesis, datasets
general 0 0 0 Blank slate, pure semantic search
cp presets/developer.yaml config.yaml   # Ready to go

Generic Use Support

With empty mappings ({}), the system operates as a domain-agnostic semantic search engine. No security-specific logic unless you want it.

Backwards Compatible

No config.yaml? The system uses built-in defaults — identical behavior to v3.2.x. Zero migration required.

Changes

  • NEW: YAML configuration system — fully customizable via config.yaml
  • NEW: Domain presets — cybersecurity, developer, research, general
  • NEW: config.example.yaml — documented template with explanations for every field
  • NEW: Categories, keyword routing, and query expansions now user-configurable
  • NEW: Empty config = pure semantic search with zero domain logic
  • NEW: Warning log for empty files during indexing (previously silent skip)
  • IMPROVED: README rewritten — full configuration reference, preset docs, updated structure
  • IMPROVED: pyyaml added as dependency

Upgrade

git pull origin master
pip install pyyaml    # New dependency
# Optionally: cp presets/cybersecurity.yaml config.yaml

No breaking changes. Existing installations work without any config file.

v3.2.4 — Symlink Support

03 Apr 22:30
0ecbb43

Choose a tag to compare

What's New

  • Symlink supportdocuments/ directory now follows symbolic links recursively (#13)
  • Circular symlink protectionrealpath deduplication prevents infinite recursion loops
  • Stricter _has_documents() detection — validates against supported formats only (ignores .gitkeep, temp files, etc.)

Changes

File Change
mcp_server/config.py _has_documents()os.walk(followlinks=True) + format filter
mcp_server/ingestion.py parse_directory()os.walk + seen_dirs loop protection

Full Changelog: v3.2.3...v3.2.4

v3.2.3 — BASE_DIR smart detection for pip install

22 Mar 23:33

Choose a tag to compare

Fix

  • BASE_DIR now checks for actual files inside documents/ (not just directory existence)
  • Prevents false positive when site-packages/documents/ exists as empty dir
  • Supports KNOWLEDGE_RAG_DIR env var for explicit override

Upgrade

pip install --upgrade knowledge-rag

v3.2.2 — pip install plug-and-play fix

22 Mar 23:10

Choose a tag to compare

Fixes

pip install knowledge-rag now truly plug-and-play

BASE_DIR was resolving to site-packages/ when installed from PyPI, causing documents/ to not be found. Now falls back to current working directory.

Supports KNOWLEDGE_RAG_DIR env var for explicit override.

category="aar" accepted by search_knowledge

The validator was rejecting aar as a category because it only checked keyword_routes keys. Now uses category_mappings values too.

Upgrade

pip install --upgrade knowledge-rag

v3.2.1 — Auto-Recovery from Corrupted ChromaDB

22 Mar 12:00

Choose a tag to compare

Fix: Auto-Recovery on Startup

If ChromaDB gets corrupted (crash during indexing, power loss, etc.), the server now automatically detects and recovers instead of crashing with a segfault loop.

What was happening

  • A crash during indexing left the SQLite DB in a corrupted state
  • Next startup: segfault → crash → restart → segfault (infinite loop)
  • Required manual deletion of data/chroma_db/ to fix

What happens now

  • Server detects corruption on startup
  • Automatically deletes corrupted data
  • Recreates fresh collection
  • Logs [RECOVERY] messages so you know it happened
  • Zero manual intervention needed

Also handles

  • Embedding function conflicts (e.g., switching models)
  • Orphaned UUID directories from partial rebuilds

Upgrade

pip install --upgrade knowledge-rag

v3.2.0 — Parallel Search + Adjacent Chunk Retrieval

20 Mar 13:20

Choose a tag to compare

New Features

Parallel BM25 + Semantic Search

Both search engines now run simultaneously in threads. ~50% latency reduction in hybrid mode.

Adjacent Chunk Retrieval

Matched chunks are automatically expanded with surrounding context. When a chunk matches your query, the system fetches the chunks immediately before and after it (from the same document) and merges them into a single expanded result.

  • Results include context_expanded: true when adjacent chunks were merged
  • Content grows from ~650 chars to ~1500 chars per result (more context for the LLM)
  • Zero impact on retrieval precision — the matching still happens on the original chunk

Inspired by PrivateGPT's SentenceWindow pattern and Kotaemon's parallel retrieval.

Upgrade

pip install --upgrade knowledge-rag

Full Changelog

v3.1.1...v3.2.0

v3.1.1 — Chunker Bugfix, AAR Category, CVE Aliases

20 Mar 12:43

Choose a tag to compare

Fixes

Markdown Chunker (critical quality fix)

  • Code-block protection: # comments inside code fences no longer split as markdown headers
  • Split by ##/### only: # (H1) was catching shell comments and code — now ignored
  • Min chunk size 100 chars: Header-only chunks (32-53 chars of junk) now merge with next section
  • Result: c2-operations doc goes from 32 chunks (12 junk) → 17 chunks (0 junk)

New

  • AAR category: documents/aar/ maps to category "aar" (was "general")
  • 14 CVE aliases: PrintNightmare↔CVE-2021-34527, EternalBlue↔MS17-010, PwnKit↔CVE-2021-4034, Log4Shell↔CVE-2021-44228, ZeroLogon↔CVE-2020-1472, PetitPotam, CertiFried, noPac, ProxyLogon, ProxyShell

Upgrade

pip install --upgrade knowledge-rag

After upgrade, run reindex_documents(full_rebuild=true) to reprocess all documents with the fixed chunker.

Full Changelog

v3.1.0...v3.1.1

v3.1.0 — DOCX/XLSX/PPTX/CSV, File Watcher, MMR

19 Mar 20:33

Choose a tag to compare

Knowledge RAG v3.1.0

New Features

Office Document Support (4 new formats)

  • DOCX: Paragraphs, tables, heading structure preserved as markdown
  • XLSX: All sheets extracted as searchable text tables
  • PPTX: Slide-by-slide text extraction
  • CSV: Native parsing, zero extra deps
  • Total: 9 formats (was 5)

File Watcher

Documents directory monitored in real-time via watchdog. Auto-reindexes with 5-second debounce when you add, modify, or delete files.

MMR Result Diversification

Maximal Marginal Relevance applied after reranking. Reduces redundant results — if top 5 were from same doc, MMR pushes varied sources up. Lambda=0.7 (relevance-heavy).

pip install

pip install knowledge-rag

No clone needed. Models download automatically.

Full Changelog

v3.0.0...v3.1.0