Skip to content

DocBO/LeanRAG

ย 
ย 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

44 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LeanRAG: Knowledge-Graph-Based Generation## ๐Ÿ–ฅ๏ธ Command Line Interface (CLI)with Semantic Aggregation ## ๐Ÿš€ Getting Startednd Hierarchical Retrieval

Python Version License: MIT arXivPRs Welcome

LeanRAG is an efficient, open-source framework for Retrieval-Augmented Generation, leveraging knowledge graph structures with semantic aggregation and hierarchical retrieval to generate context-aware, concise, and high-fidelity responses.

Update (API-first refactor): All LLM interactions now use a unified OpenAI-compatible client (OpenAIChatClient). Legacy local model orchestration (Ollama / vLLM) and the older InstanceManager have been removed. The CommonKG folder is retained temporarily as a legacy reference and will be deprecatedโ€”prefer the GraphRAG (default) extraction path.

โœจ Features

  • API-First Architecture: Cloud-based LLMs and embeddings (OpenAI/xAI, Together AI) - no local GPU requirements
  • Modern Database Stack: Qdrant for vectors, Neo4j for graphs, SQLite for relational data
  • Semantic Aggregation: Clusters entities into semantically coherent summaries and constructs explicit relations to form a navigable aggregation-level knowledge network.
  • Hierarchical, Structure-Guided Retrieval: Initiates retrieval from fine-grained entities and traverses up the knowledge graph to gather rich, highly relevant evidence efficiently.
  • Reduced Redundancy: Optimizes retrieval paths to significantly reduce redundant informationโ€”LeanRAG achieves ~46% lower retrieval redundancy compared to flat retrieval baselines (based on benchmark evaluations).
  • Benchmark Performance: Demonstrates superior performance across multiple QA benchmarks with improved response quality and retrieval efficiency.

๐Ÿ›๏ธ Architecture Overview

Overview of LeanRAG

LeanRAG's processing pipeline follows these core stages with a modern database architecture:

Database Layer

  • Qdrant: Vector storage for efficient similarity search on embeddings
  • Neo4j: Graph database for knowledge graph storage and Cypher-based queries
  • SQLite: Lightweight relational storage for metadata and intermediate results
  • API Services: Cloud-based LLMs and embedding models (no local GPU requirements)

Processing Stages

  1. Semantic Aggregation

    • Group low-level entities into clusters; generate summary nodes and build adjacency relations among them for efficient navigation.
  2. Knowledge Graph Construction

    • Construct a multi-layer graph where nodes represent entities and aggregated summaries, with explicit inter-node relations for graph-based traversal.
  3. Query Processing & Hierarchical Retrieval

    • Anchor queries at the most relevant detailed entities ("bottom-up"), then traverse upward through the semantic aggregation graph to collect evidence spans.
  4. Redundancy-Aware Synthesis

    • Streamline retrieval paths and avoid overlapping content, ensuring concise evidence aggregation before generating responses.
  5. Generation

    • Use retrieved, well-structured evidence as input to an LLM to produce coherent, accurate, and contextually grounded answers.

๏ฟฝ๏ธ Command Line Interface (CLI)

LeanRAG provides a unified CLI tool for easy workflow management:

Installation

The CLI is included with LeanRAG. No additional installation required.

Available Commands

# Check system status
leanrag check

# Chunk documents
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Extract triples and entities
leanrag extract output/mix/mix_chunk.json

# Build knowledge graph
leanrag build output/mix/

# Query the knowledge graph
leanrag query "What is machine learning?" output/mix/ --top-k 5

# Run complete pipeline (with guidance)
leanrag pipeline datasets/mix/mix.jsonl --query "What is AI?"

Command Reference

leanrag check

Validate environment setup and database connectivity.

leanrag chunk <input_path> [options]

Chunk documents for processing. Supports individual files or entire directories.

  • --strategy: semantic, hybrid, or fixed_token (default: semantic)
  • --chunk-size: Maximum tokens per chunk (default: 1024)
  • --overlap: Token overlap between chunks (default: 128)
  • --output-dir: Output directory (default: output)

Examples:

# Chunk a single JSONL file
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Chunk a single PDF file
leanrag chunk document.pdf --strategy semantic --chunk-size 1024

# Chunk all supported files in a directory (recursively)
leanrag chunk datasets/ --strategy semantic --chunk-size 1024

leanrag extract <chunk_path> [options]

Extract triples and entities from chunks (GraphRAG path). Supports individual files or entire directories.

  • --output-dir: Output directory (default: output)

Examples:

# Extract from a single chunk file
leanrag extract output/mix/mix_chunk.json

# Extract from all chunk files in a directory
leanrag extract output/

leanrag build <working_dir> [options]

Build/update knowledge graph from new entities and relationships in SQLite database.

  • --config: Configuration file path

Note: Only processes entities/relationships marked as is_new=1, then marks them as processed (is_new=0).

leanrag query <query> <working_dir> [options]

Query the knowledge graph.

  • --top-k: Number of top entities to retrieve (default: 10)
  • --chunks-file: Path to chunks file (auto-detected if not provided)

leanrag pipeline <input_file> [options]

Run the complete pipeline with guidance.

  • --query: Optional query to run after building
  • --output-dir: Output directory

๏ฟฝ๐Ÿš€ Getting Started

Prerequisites

  • Python 3.10+
  • Conda for environment management (optional)
  • Neo4j (graph database) or Docker
  • Qdrant (vector database) or Docker
  • API keys for LLM services (OpenAI/xAI, Together AI)

Installation

  1. Clone the repository:

    git clone https://github.com/RaZzzyz/LeanRAG.git
    cd LeanRAG
  2. Create a virtual environment:

    conda create -n leanrag python=3.11
    conda activate leanrag
    # Or using venv
    python -m venv leanrag-env
    source leanrag-env/bin/activate  # On Windows: leanrag-env\Scripts\activate
  3. Install the required dependencies:

    pip install -r requirements.txt

    Note: LeanRAG uses API-based models exclusively. No local model weights or GPU requirements - all processing happens via cloud APIs.

  4. Set up databases (see Database Architecture section above)

  5. Configure environment variables by copying and editing .env:

    cp .env.example .env  # If example exists, otherwise create .env
    # Edit .env with your API keys and database URLs

๐Ÿ—„๏ธ Database Architecture

LeanRAG uses a modern, API-first database architecture optimized for cloud deployment:

Vector Storage: Qdrant

  • Purpose: High-performance vector similarity search for embeddings
  • Version: 1.15.1+
  • Features: Cosine similarity, efficient indexing, scalable
  • Configuration: Set QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION in .env

Graph Database: Neo4j

  • Purpose: Native graph storage and traversal for knowledge graphs
  • Version: 5.23+ Community Edition
  • Features: Real-time graph analytics, ACID transactions, Cypher queries
  • Configuration: Set GRAPH_URI, GRAPH_USER, GRAPH_PASSWORD in .env

Relational Data: SQLite

  • Purpose: Lightweight storage for metadata and intermediate results
  • Benefits: No server setup required, file-based, ACID compliant
  • Location: leanrag.db in working directory

API-Based Models

  • LLM: OpenAI-compatible API (xAI, OpenAI, etc.)
  • Embeddings: Together AI API for text embeddings
  • Benefits: No local GPU requirements, scalable, cost-effective

Setup Instructions

  1. Install Neo4j (for graph operations):

    # Using Docker (recommended)
    docker run -d \
      --name leanrag-neo4j \
      -p 7474:7474 -p 7687:7687 \
      -v ./neo4j_data:/data \
      -v ./neo4j_logs:/logs \
      -e NEO4J_AUTH=neo4j/test123456 \
      neo4j:5.23-community
  2. Install Qdrant (for vector search):

    # Using Docker
    docker run -p 6333:6333 qdrant/qdrant
  3. Configure Environment (.env file):

    # Qdrant vector database
    QDRANT_URL=http://localhost:6333
    QDRANT_API_KEY=
    QDRANT_COLLECTION=leanrag-vectors
    
    # Neo4j graph database
    GRAPH_URI=bolt://localhost:7687
    GRAPH_USER=neo4j
    GRAPH_PASSWORD=test123456
    
    # API services
    TOGETHER_API_KEY=your_key_here
    OPENAI_API_KEY=your_xai_key_here
    OPENAI_BASE_URL=https://api.x.ai/v1

๐Ÿ’ป Usage Workflow

Hereโ€™s a typical pipeline flow using the LeanRAG CLI:

Step 1: Document Chunking

# Chunk a single file
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Or chunk an entire directory (processes all .pdf and .jsonl files recursively)
leanrag chunk datasets/ --strategy semantic --chunk-size 1024

Options:

  • --strategy: Choose semantic (recommended), hybrid, or fixed_token
  • --chunk-size: Maximum tokens per chunk (default: 1024)
  • --overlap: Token overlap between chunks (default: 128)

Output: output/<dataset_name>/<dataset_name>_chunk.json with enhanced chunk metadata


Step 2: Extract Triples and Entity Descriptions

GraphRAG Extraction

# Extract from a single chunk file
leanrag extract output/mix/mix_chunk.json

# Or extract from all chunk files in a directory (recursively)
leanrag extract output/

Note: Entities and relationships are stored in SQLite database with is_new flag


Step 3: Build/Update the Knowledge Graph

# Build graph from new entities and relationships in SQLite
leanrag build output/

Note: Only processes entities/relationships marked as is_new=1, then marks them as processed


Step 4: Query the Knowledge Graph

leanrag query "What is machine learning?" output/mix/ --top-k 5

Returns context-aware answers with evidence from the knowledge graph.


Alternative: Complete Pipeline

For new users, run the guided pipeline:

leanrag pipeline datasets/mix/mix.jsonl --query "What is AI?"

This provides step-by-step guidance and automatically sets up the workflow.

Manual Workflow (Advanced)

If you prefer manual control, you can still call core Python scripts:

# Chunk documents
python file_chunk.py

# Extract triples (configure LLM endpoints first)
python GraphExtraction/chunk.py

# Build graph
python build_graph.py

# Query
python query_graph.py

Acknowledgement

We gratefully acknowledge the use of the following open-source projects in our work:

  • nano-graphrag: a simple, easy-to-hack GraphRAG implementation

  • HiRAG: a novel hierarchy entity aggregation and optimized retrieval RAG method

๐Ÿ“„ Citation

If you find LeanRAG useful, please cite our paper:

@misc{zhang2025leanragknowledgegraphbasedgenerationsemantic,
      title={LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval}, 
      author={Yaoze Zhang and Rong Wu and Pinlong Cai and Xiaoman Wang and Guohang Yan and Song Mao and Ding Wang and Botian Shi},
      year={2025},
      eprint={2508.10391},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.10391}, 
}

Progress Bars & Optional Dependencies

tqdm and tiktoken are optional. If they are not installed the code falls back gracefully (token counting becomes heuristic; progress bars disable automatically). To disable progress bars explicitly set:

export PROGRESS=0

Environment Variables (LLM)

The chat client looks for (in order): OPENAI_API_KEY, OPENAI_BASE_URL (defaults to https://api.openai.com/v1), and a model via MODEL_LLM or OPENAI_MODEL. Set one of the model env vars, e.g.:

export OPENAI_API_KEY=sk-...
export MODEL_LLM=grok-4-fast-reasoning

Embedding still uses Together API (TOGETHER_API_KEY).

Star History

Star History Chart

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%