LeanRAG: Knowledge-Graph-Based Generation## 🖥️ Command Line Interface (CLI)with Semantic Aggregation ## 🚀 Getting Startednd Hierarchical Retrieval

LeanRAG is an efficient, open-source framework for Retrieval-Augmented Generation, leveraging knowledge graph structures with semantic aggregation and hierarchical retrieval to generate context-aware, concise, and high-fidelity responses.

Update (API-first refactor): All LLM interactions now use a unified OpenAI-compatible client (OpenAIChatClient). Legacy local model orchestration (Ollama / vLLM) and the older InstanceManager have been removed. The CommonKG folder is retained temporarily as a legacy reference and will be deprecated—prefer the GraphRAG (default) extraction path.

✨ Features

API-First Architecture: Cloud-based LLMs and embeddings (OpenAI/xAI, Together AI) - no local GPU requirements
Modern Database Stack: Qdrant for vectors, Neo4j for graphs, SQLite for relational data
Semantic Aggregation: Clusters entities into semantically coherent summaries and constructs explicit relations to form a navigable aggregation-level knowledge network.
Hierarchical, Structure-Guided Retrieval: Initiates retrieval from fine-grained entities and traverses up the knowledge graph to gather rich, highly relevant evidence efficiently.
Reduced Redundancy: Optimizes retrieval paths to significantly reduce redundant information—LeanRAG achieves ~46% lower retrieval redundancy compared to flat retrieval baselines (based on benchmark evaluations).
Benchmark Performance: Demonstrates superior performance across multiple QA benchmarks with improved response quality and retrieval efficiency.

🏛️ Architecture Overview

LeanRAG's processing pipeline follows these core stages with a modern database architecture:

Database Layer

Qdrant: Vector storage for efficient similarity search on embeddings
Neo4j: Graph database for knowledge graph storage and Cypher-based queries
SQLite: Lightweight relational storage for metadata and intermediate results
API Services: Cloud-based LLMs and embedding models (no local GPU requirements)

Processing Stages

Semantic Aggregation
- Group low-level entities into clusters; generate summary nodes and build adjacency relations among them for efficient navigation.
Knowledge Graph Construction
- Construct a multi-layer graph where nodes represent entities and aggregated summaries, with explicit inter-node relations for graph-based traversal.
Query Processing & Hierarchical Retrieval
- Anchor queries at the most relevant detailed entities ("bottom-up"), then traverse upward through the semantic aggregation graph to collect evidence spans.
Redundancy-Aware Synthesis
- Streamline retrieval paths and avoid overlapping content, ensuring concise evidence aggregation before generating responses.
Generation
- Use retrieved, well-structured evidence as input to an LLM to produce coherent, accurate, and contextually grounded answers.

�️ Command Line Interface (CLI)

LeanRAG provides a unified CLI tool for easy workflow management:

Installation

The CLI is included with LeanRAG. No additional installation required.

Available Commands

# Check system status
leanrag check

# Chunk documents
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Extract triples and entities
leanrag extract output/mix/mix_chunk.json

# Build knowledge graph
leanrag build output/mix/

# Query the knowledge graph
leanrag query "What is machine learning?" output/mix/ --top-k 5

# Run complete pipeline (with guidance)
leanrag pipeline datasets/mix/mix.jsonl --query "What is AI?"

Command Reference

`leanrag check`

Validate environment setup and database connectivity.

`leanrag chunk <input_path> [options]`

Chunk documents for processing. Supports individual files or entire directories.

--strategy: semantic, hybrid, or fixed_token (default: semantic)
--chunk-size: Maximum tokens per chunk (default: 1024)
--overlap: Token overlap between chunks (default: 128)
--output-dir: Output directory (default: output)

Examples:

# Chunk a single JSONL file
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Chunk a single PDF file
leanrag chunk document.pdf --strategy semantic --chunk-size 1024

# Chunk all supported files in a directory (recursively)
leanrag chunk datasets/ --strategy semantic --chunk-size 1024

`leanrag extract <chunk_path> [options]`

Extract triples and entities from chunks (GraphRAG path). Supports individual files or entire directories.

--output-dir: Output directory (default: output)

Examples:

# Extract from a single chunk file
leanrag extract output/mix/mix_chunk.json

# Extract from all chunk files in a directory
leanrag extract output/

`leanrag build <working_dir> [options]`

Build/update knowledge graph from new entities and relationships in SQLite database.

--config: Configuration file path

Note: Only processes entities/relationships marked as is_new=1, then marks them as processed (is_new=0).

`leanrag query <query> <working_dir> [options]`

Query the knowledge graph.

--top-k: Number of top entities to retrieve (default: 10)
--chunks-file: Path to chunks file (auto-detected if not provided)

`leanrag pipeline <input_file> [options]`

Run the complete pipeline with guidance.

--query: Optional query to run after building
--output-dir: Output directory

�🚀 Getting Started

Prerequisites

Python 3.10+
Conda for environment management (optional)
Neo4j (graph database) or Docker
Qdrant (vector database) or Docker
API keys for LLM services (OpenAI/xAI, Together AI)

Installation

Clone the repository:

git clone https://github.com/RaZzzyz/LeanRAG.git
cd LeanRAG

Create a virtual environment:

conda create -n leanrag python=3.11
conda activate leanrag
# Or using venv
python -m venv leanrag-env
source leanrag-env/bin/activate  # On Windows: leanrag-env\Scripts\activate

Install the required dependencies:
```
pip install -r requirements.txt
```
Note: LeanRAG uses API-based models exclusively. No local model weights or GPU requirements - all processing happens via cloud APIs.
Set up databases (see Database Architecture section above)

Configure environment variables by copying and editing .env:

cp .env.example .env  # If example exists, otherwise create .env
# Edit .env with your API keys and database URLs

🗄️ Database Architecture

LeanRAG uses a modern, API-first database architecture optimized for cloud deployment:

Vector Storage: Qdrant

Purpose: High-performance vector similarity search for embeddings
Version: 1.15.1+
Features: Cosine similarity, efficient indexing, scalable
Configuration: Set QDRANT_URL, QDRANT_API_KEY, QDRANT_COLLECTION in .env

Graph Database: Neo4j

Purpose: Native graph storage and traversal for knowledge graphs
Version: 5.23+ Community Edition
Features: Real-time graph analytics, ACID transactions, Cypher queries
Configuration: Set GRAPH_URI, GRAPH_USER, GRAPH_PASSWORD in .env

Relational Data: SQLite

Purpose: Lightweight storage for metadata and intermediate results
Benefits: No server setup required, file-based, ACID compliant
Location: leanrag.db in working directory

API-Based Models

LLM: OpenAI-compatible API (xAI, OpenAI, etc.)
Embeddings: Together AI API for text embeddings
Benefits: No local GPU requirements, scalable, cost-effective

Setup Instructions

Install Neo4j (for graph operations):

# Using Docker (recommended)
docker run -d \
  --name leanrag-neo4j \
  -p 7474:7474 -p 7687:7687 \
  -v ./neo4j_data:/data \
  -v ./neo4j_logs:/logs \
  -e NEO4J_AUTH=neo4j/test123456 \
  neo4j:5.23-community

Install Qdrant (for vector search):

# Using Docker
docker run -p 6333:6333 qdrant/qdrant

Configure Environment (.env file):

# Qdrant vector database
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=
QDRANT_COLLECTION=leanrag-vectors

# Neo4j graph database
GRAPH_URI=bolt://localhost:7687
GRAPH_USER=neo4j
GRAPH_PASSWORD=test123456

# API services
TOGETHER_API_KEY=your_key_here
OPENAI_API_KEY=your_xai_key_here
OPENAI_BASE_URL=https://api.x.ai/v1

💻 Usage Workflow

Here’s a typical pipeline flow using the LeanRAG CLI:

Step 1: Document Chunking

# Chunk a single file
leanrag chunk datasets/mix/mix.jsonl --strategy semantic --chunk-size 1024

# Or chunk an entire directory (processes all .pdf and .jsonl files recursively)
leanrag chunk datasets/ --strategy semantic --chunk-size 1024

Options:

--strategy: Choose semantic (recommended), hybrid, or fixed_token
--chunk-size: Maximum tokens per chunk (default: 1024)
--overlap: Token overlap between chunks (default: 128)

Output: output/<dataset_name>/<dataset_name>_chunk.json with enhanced chunk metadata

Step 2: Extract Triples and Entity Descriptions

GraphRAG Extraction

# Extract from a single chunk file
leanrag extract output/mix/mix_chunk.json

# Or extract from all chunk files in a directory (recursively)
leanrag extract output/

Note: Entities and relationships are stored in SQLite database with is_new flag

Step 3: Build/Update the Knowledge Graph

# Build graph from new entities and relationships in SQLite
leanrag build output/

Note: Only processes entities/relationships marked as is_new=1, then marks them as processed

Step 4: Query the Knowledge Graph

leanrag query "What is machine learning?" output/mix/ --top-k 5

Returns context-aware answers with evidence from the knowledge graph.

Alternative: Complete Pipeline

For new users, run the guided pipeline:

leanrag pipeline datasets/mix/mix.jsonl --query "What is AI?"

This provides step-by-step guidance and automatically sets up the workflow.

Manual Workflow (Advanced)

If you prefer manual control, you can still call core Python scripts:

# Chunk documents
python file_chunk.py

# Extract triples (configure LLM endpoints first)
python GraphExtraction/chunk.py

# Build graph
python build_graph.py

# Query
python query_graph.py

Acknowledgement

We gratefully acknowledge the use of the following open-source projects in our work:

nano-graphrag: a simple, easy-to-hack GraphRAG implementation
HiRAG: a novel hierarchy entity aggregation and optimized retrieval RAG method

📄 Citation

If you find LeanRAG useful, please cite our paper:

@misc{zhang2025leanragknowledgegraphbasedgenerationsemantic,
      title={LeanRAG: Knowledge-Graph-Based Generation with Semantic Aggregation and Hierarchical Retrieval}, 
      author={Yaoze Zhang and Rong Wu and Pinlong Cai and Xiaoman Wang and Guohang Yan and Song Mao and Ding Wang and Botian Shi},
      year={2025},
      eprint={2508.10391},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2508.10391}, 
}

Progress Bars & Optional Dependencies

tqdm and tiktoken are optional. If they are not installed the code falls back gracefully (token counting becomes heuristic; progress bars disable automatically). To disable progress bars explicitly set:

export PROGRESS=0

Environment Variables (LLM)

The chat client looks for (in order): OPENAI_API_KEY, OPENAI_BASE_URL (defaults to https://api.openai.com/v1), and a model via MODEL_LLM or OPENAI_MODEL. Set one of the model env vars, e.g.:

export OPENAI_API_KEY=sk-...
export MODEL_LLM=grok-4-fast-reasoning

Embedding still uses Together API (TOGETHER_API_KEY).

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
GraphExtraction		GraphExtraction
datasets		datasets
pic		pic
sqlite		sqlite
tools		tools
.gitignore		.gitignore
README.md		README.md
_cluster_utils.py		_cluster_utils.py
build_graph.py		build_graph.py
database_utils.py		database_utils.py
docker-compose.yml		docker-compose.yml
file_chunk.py		file_chunk.py
leanrag_cli.py		leanrag_cli.py
neo4j_check.py		neo4j_check.py
prompt.py		prompt.py
qdrant_manager.py		qdrant_manager.py
query_graph.py		query_graph.py
requirements.txt		requirements.txt
tutorial.md		tutorial.md
visualize_graph.py		visualize_graph.py

DocBO/LeanRAG

Folders and files

Latest commit

History

Repository files navigation

LeanRAG: Knowledge-Graph-Based Generation## 🖥️ Command Line Interface (CLI)with Semantic Aggregation ## 🚀 Getting Startednd Hierarchical Retrieval

✨ Features

🏛️ Architecture Overview

Database Layer

Processing Stages

�️ Command Line Interface (CLI)

Installation

Available Commands

Command Reference

leanrag check

leanrag chunk <input_path> [options]

leanrag extract <chunk_path> [options]

leanrag build <working_dir> [options]

leanrag query <query> <working_dir> [options]

leanrag pipeline <input_file> [options]

�🚀 Getting Started

Prerequisites

Installation

🗄️ Database Architecture

Vector Storage: Qdrant

Graph Database: Neo4j

Relational Data: SQLite

API-Based Models

Setup Instructions

💻 Usage Workflow

Step 1: Document Chunking

Step 2: Extract Triples and Entity Descriptions

Step 3: Build/Update the Knowledge Graph

Step 4: Query the Knowledge Graph

Alternative: Complete Pipeline

Manual Workflow (Advanced)

Acknowledgement

📄 Citation

Progress Bars & Optional Dependencies

Environment Variables (LLM)

Star History

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`leanrag check`

`leanrag chunk <input_path> [options]`

`leanrag extract <chunk_path> [options]`

`leanrag build <working_dir> [options]`

`leanrag query <query> <working_dir> [options]`

`leanrag pipeline <input_file> [options]`

Packages