Code Chunker

A pragmatic multi-language code parser optimized for LLM applications and RAG systems.

Features

Multi-language support: Python, JavaScript, TypeScript, Solidity, Go, Rust
Optimized for LLMs: Provides structured output ideal for language models
Lightweight: Minimal dependencies, fast parsing
Configurable: Adjust chunk sizes, confidence thresholds, and more
Easy to use: Simple API with both file and directory parsing
Incremental parsing: Efficiently update parse results when code changes
Enhanced language support:
- TypeScript/React: Component, Hook, and Context detection
- Solidity: Smart contract metadata extraction (visibility, modifiers, payable)
- Go: Concurrency pattern detection (goroutines, channels, mutexes)

Installation

pip install code-chunker

Quick Start

from code_chunker import CodeChunker

# Initialize the chunker
chunker = CodeChunker()

# Parse a code string
code = """
def hello_world():
    print("Hello, World!")
"""

result = chunker.parse(code, language='python')

# Print the chunks
for chunk in result.chunks:
    print(f"{chunk.type.value}: {chunk.name} (lines {chunk.start_line}-{chunk.end_line})")

# Parse a file
result = chunker.parse_file('example.py')

# Parse a directory
results = chunker.parse_directory('src/')

Configuration

from code_chunker import CodeChunker, ChunkerConfig

config = ChunkerConfig(
    max_chunk_size=2000,
    min_chunk_size=100,
    include_comments=True,
    confidence_threshold=0.8
)

chunker = CodeChunker(config=config)

Incremental Parsing

Incremental parsing allows you to efficiently update parse results when code changes, without reparsing the entire file.

from code_chunker import CodeChunker, IncrementalParser

# Initialize the incremental parser
incremental_parser = IncrementalParser()

# First parse (full parse)
result1 = incremental_parser.full_parse("path/to/file.py")

# After file changes, perform an incremental parse
result2 = incremental_parser.incremental_parse("path/to/file.py")

# Compare the results
print(f"Full parse chunks: {len(result1.chunks)}")
print(f"Incremental parse chunks: {len(result2.chunks)}")

Enhanced Language Support

TypeScript/React Support

Code Chunker provides specialized support for React components, hooks, and contexts:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get React-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('typescript', 'react'))
chunker = CodeChunker(config=config)

# Parse React component
result = chunker.parse(react_code, language='typescript')

# Filter for React components
components = [chunk for chunk in result.chunks if chunk.type.value == 'component']
for component in components:
    print(f"Component: {component.name} (type: {component.metadata.get('component_type')})")

Solidity Smart Contract Support

Enhanced metadata extraction for smart contracts:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get Solidity-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('solidity', 'contract'))
chunker = CodeChunker(config=config)

# Parse Solidity contract
result = chunker.parse(contract_code, language='solidity')

# Find payable functions
payable_functions = [
    chunk for chunk in result.chunks 
    if chunk.type.value == 'function' and chunk.metadata.get('is_payable', False)
]

Go Concurrency Pattern Detection

Automatically detect concurrency patterns in Go code:

from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case

# Get Go-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('go', 'performance'))
chunker = CodeChunker(config=config)

# Parse Go code
result = chunker.parse(go_code, language='go')

# Find functions with goroutines
concurrent_funcs = [
    chunk for chunk in result.chunks 
    if chunk.type.value in ['function', 'method'] 
    and 'goroutines' in chunk.metadata.get('concurrency_patterns', {})
]

Supported Languages

Python (.py)
JavaScript (.js, .jsx)
TypeScript (.ts, .tsx)
Solidity (.sol)
Go (.go)
Rust (.rs)

Examples

The examples/ directory contains several examples demonstrating different features:

Basic Usage

Simple parsing examples:

python examples/basic_usage.py

Advanced Usage

Custom configuration and analysis:

python examples/advanced_usage.py

Incremental Parsing

Efficient parsing of code changes:

python examples/incremental_parsing.py

RAG Integration

Integration with RAG systems:

python examples/rag_integration.py

Edge Cases

Testing various edge cases across languages:

python examples/edge_cases.py

Performance Analysis

Analyze parsing performance:

python examples/performance_analysis.py

Code Quality Analysis

Analyze code quality metrics:

python examples/quality_analysis.py <file_path>

Visualization

Generate code structure visualization:

python examples/visualization.py <file_path>

API Reference

CodeChunker

The main class for parsing code.

chunker = CodeChunker(config=None)

Methods

parse(code: str, language: str) -> ParseResult: Parse a code string
parse_file(file_path: Union[str, Path]) -> ParseResult: Parse a file
parse_directory(directory: Union[str, Path], recursive: bool = True, extensions: Optional[List[str]] = None) -> List[ParseResult]: Parse a directory

IncrementalParser

For efficient incremental parsing.

parser = IncrementalParser(chunker=None)

Methods

full_parse(file_path: str) -> ParseResult: Perform a full parse and cache the result
parse_incremental(file_path: str, changes: List[Tuple[int, int, str]]) -> ParseResult: Parse incrementally based on changes
invalidate_cache(file_path: Optional[str] = None) -> None: Invalidate cache for a file or all files

How Incremental Parsing Works

Initial Parse: The first parse of a file is a full parse, which is cached
Change Detection: When changes are made, only affected code regions are identified
Selective Reparsing: Only affected chunks are reparsed, preserving the rest
Result Merging: Updated chunks are merged with unchanged chunks
Smart Caching: Results are cached for future incremental updates

ParseResult

The result of parsing code.

Attributes

language: str: The programming language
file_path: Optional[str]: Path to the source file
chunks: List[CodeChunk]: List of code chunks
imports: List[Import]: List of imports
exports: List[str]: List of exports
raw_code: str: The original code

CodeChunk

Represents a piece of code.

Attributes

type: ChunkType: The type of chunk (function, class, etc.)
name: Optional[str]: The name of the chunk
code: str: The actual code
start_line: int: Starting line number
end_line: int: Ending line number
language: str: Programming language
confidence: float: Confidence score (0-1)
metadata: Dict[str, Any]: Additional metadata

Dependencies

For basic usage: No external dependencies
For performance analysis: psutil
For visualization: Modern web browser to view generated HTML

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Development Setup

Clone the repository
Install development dependencies:
```
pip install -e ".[dev]"
```
Run tests:
```
pytest
```
Format code:
```
black code_chunker/
```

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

If you find this project helpful, consider supporting its development:

⭐ Star this repository
🐛 Report bugs and suggest features
🤝 Submit pull requests
💰 EVM(ETH, ARB, BNB, OP..etc): 0x8f74959530dba14394b27faac92955aa96927e8b

Acknowledgments

Thanks to all contributors and the open-source community for their support.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
code_chunker		code_chunker
examples		examples
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
basic_usage_structure.html		basic_usage_structure.html
pyproject.toml		pyproject.toml
setup.py		setup.py
test_file.py		test_file.py
test_file_structure.html		test_file_structure.html

License

JimAiMoment/code-chunker

Folders and files

Latest commit

History

Repository files navigation

Code Chunker

Features

Installation

Quick Start

Configuration

Incremental Parsing

Enhanced Language Support

TypeScript/React Support

Solidity Smart Contract Support

Go Concurrency Pattern Detection

Supported Languages

Examples

Basic Usage

Advanced Usage

Incremental Parsing

RAG Integration

Edge Cases

Performance Analysis

Code Quality Analysis

Visualization

API Reference

CodeChunker

Methods

IncrementalParser

Methods

How Incremental Parsing Works

ParseResult

Attributes

CodeChunk

Attributes

Dependencies

Contributing

Development Setup

License

Support

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages