This file helps AI agents quickly understand and contribute to the Context Builder codebase.
Context Builder is a blazing-fast Rust CLI for aggregating entire codebases into single, LLM-friendly markdown files. Published on crates.io under MIT license.
If this is your first time: Read this file, then run cargo run -- --help to see all options.
| Technology | Usage |
|---|---|
| Language | Rust (Edition 2024) |
| Build | Cargo (no npm/bun/node) |
| CLI | clap (derive) |
| Parallelism | rayon (optional, default on) + crossbeam-channel |
| Diffing | similar (unified diffs) |
| File traversal | ignore crate (gitignore-aware) |
| Token counting | tiktoken-rs (cl100k_base) |
| Caching | JSON + fs2 file locking |
| Config | TOML (context-builder.toml) |
| Encoding | encoding_rs (transcoding non-UTF-8) |
| Logging | env_logger |
| Branch | master (not main) |
context-builder/
├── src/
│ ├── main.rs # Entry point — calls lib::run()
│ ├── lib.rs # Core orchestration, run_with_args(), Prompter trait, --init
│ ├── cli.rs # Args struct via clap derive
│ ├── config.rs # Config struct, TOML deserialization
│ ├── config_resolver.rs # Merges CLI args + TOML config (CLI > config > defaults)
│ ├── file_utils.rs # .gitignore-aware traversal, OverrideBuilder for custom ignores
│ ├── tree.rs # BTreeMap file tree (deterministic ordering)
│ ├── state.rs # ProjectState/FileState structured snapshots
│ ├── markdown.rs # Streaming file renderer, binary detection, encoding, parallel
│ ├── cache.rs # JSON-based caching with fs2 locking, old cache migration
│ ├── diff.rs # Per-file unified diffs via similar
│ └── token_count.rs # Real tokenization via tiktoken-rs (cl100k_base, lazy init)
├── tests/ # 10 integration test files
├── benches/ # Criterion benchmark suite
├── scripts/ # generate_samples.rs (benchmark dataset generator)
├── context-builder.toml # Project's own config file
├── Cargo.toml # Crate metadata, dependencies, features
├── DEVELOPMENT.md # Contributor guide
├── BENCHMARKS.md # Performance benchmarking guide
├── CHANGELOG.md # Release history
└── .github/workflows/ci.yml # CI: fmt, clippy, build, test, security audit (ubuntu/win/macos)
# Build
cargo build
# Run
cargo run -- --help
cargo run -- -d . -o out.md -f rs -f toml
cargo run -- --preview # File tree only, no output
cargo run -- --init # Create config file with auto-detected filters
# Test (MUST use single thread — tests share CWD)
cargo test -- --test-threads=1
# Lint (must pass -D warnings)
cargo clippy --all-targets --all-features -- -D warnings
# Format
cargo fmt --all-
Promptertrait — Abstracts user confirmation (overwrite/processing). Tests useMockPrompter/TestPrompter. Never add stdin reads in library code. -
Streaming writes —
markdown.rsprocesses files line-by-line for low memory. Withparallelfeature, uses crossbeam channels for concurrent processing. -
Structured state — v0.5.0 replaced fragile text-based cache parsing with JSON
ProjectStatesnapshots for reliable auto-diff. -
Deterministic output —
BTreeMapeverywhere ensures identical output across runs. -
Config precedence — CLI args > TOML config > defaults, with explicit detection in
config_resolver.rs.
| Feature | Default | Purpose |
|---|---|---|
parallel |
✅ | Rayon for parallel file processing |
samples-bin |
❌ | Exposes generate_samples binary for benchmarking |
| Variable | Purpose |
|---|---|
CB_SILENT |
"1" suppresses user-facing prints (benchmarks set this) |
CB_BENCH_MEDIUM |
"1" enables heavier benchmark datasets |
CB_BENCH_DATASET_DIR |
External benchmark dataset root |
RUST_LOG |
Controls env_logger verbosity (e.g., RUST_LOG=info) |
- Error handling — Use
io::Result. Prefer returning errors over panicking.unwrap()/expect()OK in tests, NOT in library code. - Cross-platform — Normalize path separators in tests for string comparisons.
- New CLI flags — Add in
cli.rs, update tests in same file, propagate throughrun_with_args. - Language detection — Keep simple and deterministic; add mappings in one place.
- Binary detection — Lightweight: NUL byte check + UTF-8 validity.
- Logging — Use
log::{info, warn, error}. Letenv_loggercontrol emission.
- Unit tests: Inline
#[cfg(test)]modules in every source file - Integration tests (10 files in
tests/):test_auto_diff.rs— Auto-diff workflow (largest test file)test_determinism.rs— Output determinism verificationtest_config_resolution.rs— CLI/config merge behaviortest_cwd_independence.rs— Path independencetest_comprehensive_edge_cases.rs— Edge casescli_integration.rs— End-to-end CLI teststest_binary_file_autodiff.rs,test_parallel_memory.rs,test_phase4_integration.rs,diff_integration.rs
- Benchmarks: Criterion suite at
benches/context_bench.rs
Critical: Tests MUST run with --test-threads=1 (CI enforces this). Many tests use set_current_dir() which is process-global. Use #[serial] attribute where order matters.
- Year in tests: Watch for hardcoded year strings in timestamp assertions. Use dynamic
Utc::now().format("%Y")instead. - CWD mutation: Tests that
set_current_dir()must restore the original directory in all code paths (including panics). - Config from CWD:
load_config()reads from CWD.load_config_from_path()reads from explicit root. Prefer the latter in tests. - Cache collisions: Cache keys are project-path + config hash. Different configs = different cache files.
cargo fmt --all && cargo clippy --all-targets --all-features -- -D warnings && cargo test -- --test-threads=1- Bump
versioninCargo.toml, add entry toCHANGELOG.md git commit -am "chore(release): vX.Y.Z" && git tag vX.Y.Z && git push && git push --tagscargo publish