A pragmatic multi-language code parser optimized for LLM applications and RAG systems.
- Multi-language support: Python, JavaScript, TypeScript, Solidity, Go, Rust
- Optimized for LLMs: Provides structured output ideal for language models
- Lightweight: Minimal dependencies, fast parsing
- Configurable: Adjust chunk sizes, confidence thresholds, and more
- Easy to use: Simple API with both file and directory parsing
- Incremental parsing: Efficiently update parse results when code changes
- Enhanced language support:
- TypeScript/React: Component, Hook, and Context detection
- Solidity: Smart contract metadata extraction (visibility, modifiers, payable)
- Go: Concurrency pattern detection (goroutines, channels, mutexes)
pip install code-chunkerfrom code_chunker import CodeChunker
# Initialize the chunker
chunker = CodeChunker()
# Parse a code string
code = """
def hello_world():
print("Hello, World!")
"""
result = chunker.parse(code, language='python')
# Print the chunks
for chunk in result.chunks:
print(f"{chunk.type.value}: {chunk.name} (lines {chunk.start_line}-{chunk.end_line})")
# Parse a file
result = chunker.parse_file('example.py')
# Parse a directory
results = chunker.parse_directory('src/')from code_chunker import CodeChunker, ChunkerConfig
config = ChunkerConfig(
max_chunk_size=2000,
min_chunk_size=100,
include_comments=True,
confidence_threshold=0.8
)
chunker = CodeChunker(config=config)Incremental parsing allows you to efficiently update parse results when code changes, without reparsing the entire file.
from code_chunker import CodeChunker, IncrementalParser
# Initialize the incremental parser
incremental_parser = IncrementalParser()
# First parse (full parse)
result1 = incremental_parser.full_parse("path/to/file.py")
# After file changes, perform an incremental parse
result2 = incremental_parser.incremental_parse("path/to/file.py")
# Compare the results
print(f"Full parse chunks: {len(result1.chunks)}")
print(f"Incremental parse chunks: {len(result2.chunks)}")Code Chunker provides specialized support for React components, hooks, and contexts:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get React-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('typescript', 'react'))
chunker = CodeChunker(config=config)
# Parse React component
result = chunker.parse(react_code, language='typescript')
# Filter for React components
components = [chunk for chunk in result.chunks if chunk.type.value == 'component']
for component in components:
print(f"Component: {component.name} (type: {component.metadata.get('component_type')})")Enhanced metadata extraction for smart contracts:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get Solidity-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('solidity', 'contract'))
chunker = CodeChunker(config=config)
# Parse Solidity contract
result = chunker.parse(contract_code, language='solidity')
# Find payable functions
payable_functions = [
chunk for chunk in result.chunks
if chunk.type.value == 'function' and chunk.metadata.get('is_payable', False)
]Automatically detect concurrency patterns in Go code:
from code_chunker import CodeChunker, ChunkerConfig, get_config_for_use_case
# Get Go-optimized configuration
config = ChunkerConfig(**get_config_for_use_case('go', 'performance'))
chunker = CodeChunker(config=config)
# Parse Go code
result = chunker.parse(go_code, language='go')
# Find functions with goroutines
concurrent_funcs = [
chunk for chunk in result.chunks
if chunk.type.value in ['function', 'method']
and 'goroutines' in chunk.metadata.get('concurrency_patterns', {})
]- Python (.py)
- JavaScript (.js, .jsx)
- TypeScript (.ts, .tsx)
- Solidity (.sol)
- Go (.go)
- Rust (.rs)
The examples/ directory contains several examples demonstrating different features:
Simple parsing examples:
python examples/basic_usage.pyCustom configuration and analysis:
python examples/advanced_usage.pyEfficient parsing of code changes:
python examples/incremental_parsing.pyIntegration with RAG systems:
python examples/rag_integration.pyTesting various edge cases across languages:
python examples/edge_cases.pyAnalyze parsing performance:
python examples/performance_analysis.pyAnalyze code quality metrics:
python examples/quality_analysis.py <file_path>Generate code structure visualization:
python examples/visualization.py <file_path>The main class for parsing code.
chunker = CodeChunker(config=None)parse(code: str, language: str) -> ParseResult: Parse a code stringparse_file(file_path: Union[str, Path]) -> ParseResult: Parse a fileparse_directory(directory: Union[str, Path], recursive: bool = True, extensions: Optional[List[str]] = None) -> List[ParseResult]: Parse a directory
For efficient incremental parsing.
parser = IncrementalParser(chunker=None)full_parse(file_path: str) -> ParseResult: Perform a full parse and cache the resultparse_incremental(file_path: str, changes: List[Tuple[int, int, str]]) -> ParseResult: Parse incrementally based on changesinvalidate_cache(file_path: Optional[str] = None) -> None: Invalidate cache for a file or all files
- Initial Parse: The first parse of a file is a full parse, which is cached
- Change Detection: When changes are made, only affected code regions are identified
- Selective Reparsing: Only affected chunks are reparsed, preserving the rest
- Result Merging: Updated chunks are merged with unchanged chunks
- Smart Caching: Results are cached for future incremental updates
The result of parsing code.
language: str: The programming languagefile_path: Optional[str]: Path to the source filechunks: List[CodeChunk]: List of code chunksimports: List[Import]: List of importsexports: List[str]: List of exportsraw_code: str: The original code
Represents a piece of code.
type: ChunkType: The type of chunk (function, class, etc.)name: Optional[str]: The name of the chunkcode: str: The actual codestart_line: int: Starting line numberend_line: int: Ending line numberlanguage: str: Programming languageconfidence: float: Confidence score (0-1)metadata: Dict[str, Any]: Additional metadata
- For basic usage: No external dependencies
- For performance analysis:
psutil - For visualization: Modern web browser to view generated HTML
Contributions are welcome! Please feel free to submit a Pull Request.
- Clone the repository
- Install development dependencies:
pip install -e ".[dev]" - Run tests:
pytest
- Format code:
black code_chunker/
This project is licensed under the MIT License - see the LICENSE file for details.
If you find this project helpful, consider supporting its development:
- β Star this repository
- π Report bugs and suggest features
- π€ Submit pull requests
- π° EVM(ETH, ARB, BNB, OP..etc):
0x8f74959530dba14394b27faac92955aa96927e8b
Thanks to all contributors and the open-source community for their support.