jurasofish · jurasofish · Mar 2, 2025 · Mar 2, 2025 · Mar 2, 2025 · Mar 2, 2025
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,5 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 .idea/
+
+.DS_Store
diff --git a/README.md b/README.md
@@ -1,38 +1,39 @@
-# MCPunk 🤖
+# [MCPunk 🤖](https://github.com/jurasofish/mcpunk)
 
-Homepage https://github.com/jurasofish/mcpunk
+**Chat with your codebase without embeddings by giving the LLM tools to search your code intelligently.**
 
-MCPunk provides tools for [Roaming RAG](https://arcturus-labs.com/blog/2024/11/21/roaming-rag--make-_the-model_-find-the-answers/)
-with [Model Context Protocol](https://github.com/modelcontextprotocol).
+MCPunk lets you explore and understand codebases through conversation. It works by:
 
-MCPunk is built with the following in mind
+1. Breaking files into logical chunks (functions, classes, markdown sections)
+2. Giving the LLM tools to search and query these chunks
+3. Letting the LLM find the specific code it needs to answer your questions
 
-- **Context is King** - LLMs can be great but only if provided with appropriate
-  context: not too long, focused and relevant.
-- **Human in the Loop** - **You** can see exactly what data the LLM has considered
-  and how it found it, **You** can jump into chat and direct things wherever you want.
-- **Tools are Next** - LLMs have landed. And the vibe is that improvements are stagnating.
-  The next wave of LLM-based productivity will come from the tools that use LLMs well.
+No embeddings, no complex configuration - just clear, auditable searching that you can see and guide.
+It works great with Claude Desktop, or any other [MCP](https://github.com/modelcontextprotocol) client.
+
+[GitHub Repository](https://github.com/jurasofish/mcpunk)
 
-**Core functionality** allows your LLM to configure a project (e.g. a directory
-containing Python files). The files in this project are automatically "chunked".
-Each chunk is e.g. a Python function or a markdown section.
-The LLM can then query the entire project for chunks with specific names or with
-specific text in their contents. The LLM can then fetch the entire contents
-of individual chunks.
+Built with the following in mind
 
-All this with no SaaS, no pricing, nothing (well you need a claude SaaS sub).
-Just you and Claude Desktop, with all tools running on your local machine after
-one tiny snippet in `claude_desktop_config.json`.
+- **Context is King** - LLMs can be great but only if provided with appropriate context.
+- **Context is Precious** - LLMs need context, but they can't handle too much.
+  A travesty! MCPunk is RAG that inherently provides the LLM contextual
+  hints, allowing the LLM to really narrow in on only the relevant content.
+- **Human in the Loop** - **You** can see exactly what data the LLM has considered
+  and how it found it, **You** can jump into chat and direct things wherever you want.
 
-# Getting started
+# Setup
 
-First, [install uv](https://docs.astral.sh/uv/getting-started/installation/).
+_These are instructions for Claude Desktop, but MCPunk can be used anywhere MCP is used._
 
-Next, put the following in your `claude_desktop_config.json`.
-[Details about `claude_desktop_config.json` including location](https://glama.ai/blog/2024-11-25-model-context-protocol-quickstart#testing-mcp-using-claude-desktop).
+1. **[Install uv](https://docs.astral.sh/uv/getting-started/installation/)**
+1. **Put the snippet below in your `claude_desktop_config.json`**
+   ([Details about `claude_desktop_config.json` including location](https://glama.ai/blog/2024-11-25-model-context-protocol-quickstart#testing-mcp-using-claude-desktop))
+1. **Restart Claude Desktop** and you should see the tools available after a small
+   delay, like the screengrab below
+1. **Chat away**: "hey pal can you set up the ~/git/mcpunk project and tell me how python files are chunked"
 
-`"command": "uvx",` might not work, and you may need to use e.g. `"command": "/Users/michael/.local/bin/uvx",`
+(_`"command": "uvx",` might not work, and you may need to use e.g. `"command": "/Users/michael/.local/bin/uvx",`_)
 
 ```json
 {
@@ -45,13 +46,123 @@ Next, put the following in your `claude_desktop_config.json`.
 }
 ```
 
-Next, start claude desktop and you should see the tools available after a small delay
 ![](assets/setup.png)
 
-Next, ask it to set up a project like "hey pal can you set up the ~/git/mcpunk project".
-Then start asking away, like "What sorts of chunks come from Python files?"
+# Usage Examples
+
+### Answer Questions About Your Codebase
+
+- **[User]** Hey pal can you please set up the ~/git/mcpunk repo, then help me
+  understand how a user of mcpunk would get the git diff of the currently
+  checked out branch in the repo that they have configured with mcpunk
+- **[Claude]** Call `configure_project`
+- **[Claude]** Call `list_all_files_in_project`
+- **[Claude]** Call `find_files_by_chunk_content`
+- **[Claude]** ... makes a total of 21 tool calls with MCPunk.
+- **[Claude]** Now I understand how the git diff functionality works in mcpunk.
+  Let me summarize how a user would get the git diff of the currently checked out branch:
+  To get a git diff of the currently checked out branch in a repo configured with mcpunk:
+  1. First, the user needs to configure mcpunk to work with their project by using the configure_project tool
+  2. To see what branches were recently checked out (to identify what to diff against)
+  3. Then, to get the diff between the current branch (HEAD) and a reference branch
+  This will display the differences between the current branch (HEAD) and the specified reference branch.
+  The ref parameter should be the name of the base branch (like "main" or "develop") that you want to compare against.
+- **[User]** Great job pal!
+
+### PR Review
+
+- **[user]** Hey pal can you please set up the ~/git/mcpunk repo and have a
+  look at the current branch versus `scratch/1.5`
+- **[Claude]** Sets up ~/git/mcpunk and calls `diff_with_ref` for ref `scratch/1.5`
+- **[Claude]** I see you've made several changes, primarily adding a new test
+  file `tests/test_git_analysis.py`...
+- **[user]** how does this fit into the rest of the codebase?
+- **[Claude]** Makes 5 tool calls for listing files by chunks, listing chunks,
+  fetching chunk details
+- **[Claude]** I see - the code parses git reflog output to track recently
+  checked out branches. The main function get_recent_branches takes a repo path
+  and returns the last N branches that were checked out, using _branches_from_reflog
+  as a helper to parse the raw reflog output.
+- **[User]** If you were in a rush and had to direct someone to review this work,
+  what would you direct them to look at? Please outline discrete pieces of work
+- **[Claude]**
+  - The `split("to ")[-1]` in _branches_from_reflog looks fragile
+  - Test data only shows simple branch names, what about spaces/special chars?
+  - Is example_reflog representative of real git reflog output?
+  - Are we testing against actual git behavior?
+- **[User]** You're the best!
+
+### Work across multiple codebases
+
+You can just ask your LLM to set up multiple projects, and it can freely query
+across them. Handy if one depends on the other, and they're in different repos.
+In this case the LLM should recognise this via imports.
+
+# What is MCPunk & Other Background
+
+MCPunk is an MCP server that provides tools to
+
+- Configure a project, which is a directory of files. When configured, the files
+  will be split into logical chunks. MCPunk is built for code, but really
+  it could be applied to any documents, even images if you want to.
+- Search for files in a project containing specific text
+- Search for chunks in a file containing specific text
+- View the full contents of a specific chunk
+
+Along with this, it provides a few chunkers built in. The most mature is the
+Python chunker.
+
+MCPunk doesn't have to be used for conversation. It can be used as part of code
+review in a CI pipeline, for example. It's really general RAG.
+
+```mermaid
+sequenceDiagram
+    participant User
+    participant Claude as Claude Desktop
+    participant MCPunk as MCPunk Server
+    participant Files as File System
+
+    Note over User,Files: Setup Phase
+    User->>Claude: Ask question about codebase
+    Claude->>MCPunk: configure_project(root_path, project_name)
+    MCPunk->>Files: Scan files in root directory
+
+    Note over MCPunk,Files: Chunking Process
+    MCPunk->>MCPunk: For each file, apply appropriate chunker:
+    MCPunk->>MCPunk: - PythonChunker: functions, classes, imports
+    MCPunk->>MCPunk: - MarkdownChunker: sections by headings
+    MCPunk->>MCPunk: - VueChunker: template/script/style sections
+    MCPunk->>MCPunk: - WholeFileChunker: fallback
+    MCPunk->>MCPunk: Split chunks >10K chars into parts
+
+    MCPunk-->>Claude: Project configured with N files
+
+    Note over User,Files: Navigation Phase<br>(LLM freely uses all these tools repeatedly to drill in)
+    Claude->>MCPunk: list_all_files_in_project(project_name)
+    MCPunk-->>Claude: File tree structure
+
+    Claude->>MCPunk: find_files_by_chunk_content(project_name, "search term")
+    MCPunk-->>Claude: Files containing matching chunks
+
+    Claude->>MCPunk: find_matching_chunks_in_file(project_name, file_path, "search term")
+    MCPunk-->>Claude: List of matching chunk IDs in file
+
+    Claude->>MCPunk: chunk_details(chunk_id)
+    MCPunk-->>Claude: Full content of specific chunk
+
+    Claude->>User: Answer based on relevant code chunks
+
+    Note over User,Files: Optional Git Analysis
+    Claude->>MCPunk: list_most_recently_checked_out_branches(project_name)
+    MCPunk->>Files: Parse git reflog
+    MCPunk-->>Claude: List of recent branches
+
+    Claude->>MCPunk: diff_with_ref(project_name, "main")
+    MCPunk->>Files: Generate git diff
+    MCPunk-->>Claude: Diff between HEAD and reference
+```
 
-# Roaming RAG
+### Roaming RAG Crash Course
 
 See
 
@@ -75,29 +186,44 @@ Compared to more traditional "vector search" RAG:
 - You can see exactly what the LLM is searching for, and it's generally
   obvious if it's searching poorly and you can help it out by suggesting
   improved search terms.
+- Requires exact search matching. MCPunk is NOT providing fuzzy search of any kind.
 
-#### Chunks
+### Chunks
 
 A chunk is a subsection of a file. For example,
 
 - A single python function
 - A markdown section
 - All the imports from a Python file
-- The diff of one file out of a multi-file diff
 
 Chunks are created from a file by [chunkers](mcpunk/file_chunkers.py),
 and MCPunk comes with a handful built in.
 
 When a project is set up in MCPunk, it goes through all files and applies
-the first matching chunker to it. The LLM can then use tools to (1) query for files
+the first applicable chunker to it. The LLM can then use tools to (1) query for files
 containing chunks with specific text in them, (2) query all chunks in a specific
 file, and (3) fetch the full contents of a chunk.
 
 This basic foundation enables claude to effectively navigate relatively large
 codebases by starting with a broad search for relevant files and narrowing in
 on relevant areas.
 
-#### Custom Chunkers
+Built-in chunkers:
+
+- `PythonChunker` chunks things into classes, functions, file-level imports,
+  and file-level statements (e.g. globals). Applicable to files ending in `.py`
+- `VueChunker` chunks into 'template', 'script', 'style' chunks - or whatever
+  top-level `<blah>....</blah>` items exist. Applicable to files ending in `.vue`
+- `MarkdownChunker` chunks things into markdown sections (by heading).
+  Applicable to files ending in `.md`
+- `WholeFileChunker` fallback chunker that creates a single chunk for the entire file.
+  Applicable to any file.
+
+Any chunk over 10k characters long (configurable) is automatically split into
+multiple chunks, with names suffixed with `part1`, `part2`, etc. This helps
+avoid blowing out context while still allowing reasonable navigation of chunks.
+
+### Custom Chunkers
 
 Each type of file (e.g. Python vs C) needs a custom chunker.
 MCPunk comes with some [built in](mcpunk/file_chunkers.py).
@@ -116,55 +242,6 @@ to advertise that they have custom chunkers for MCPunk to use, like pytest's
 plugin system, but there are currently no plans to implement this (unless
 someone wants to do it).
 
-# Common Usage Patterns
-
-### Answer Questions About Your Codebase
-
-- **[User]** Hey pal can you please set up the ~/git/mcpunk repo, then help me
-  understand how a user of mcpunk would get the git diff of the currently
-  checked out branch in the repo that they have configured with mcpunk
-- **[Claude]** Call `configure_project`
-- **[Claude]** Call `list_all_files_in_project`
-- **[Claude]** Call `find_files_by_chunk_content`
-- **[Claude]** ... makes a total of 21 tool calls with MCPunk.
-- **[Claude]** Now I understand how the git diff functionality works in mcpunk.
-  Let me summarize how a user would get the git diff of the currently checked out branch:
-  To get a git diff of the currently checked out branch in a repo configured with mcpunk:
-  1. First, the user needs to configure mcpunk to work with their project by using the configure_project tool
-  2. To see what branches were recently checked out (to identify what to diff against)
-  3. Then, to get the diff between the current branch (HEAD) and a reference branch
-  This will display the differences between the current branch (HEAD) and the specified reference branch.
-  The ref parameter should be the name of the base branch (like "main" or "develop") that you want to compare against.
-- **[User]** Great job pal!
-
-### PR Review
-
-- **[user]** Hey pal can you please set up the ~/git/mcpunk repo and have a
-  look at the current branch versus `scratch/1.5`
-- **[Claude]** Sets up ~/git/mcpunk and calls `diff_with_ref` for ref `scratch/1.5`
-- **[Claude]** I see you've made several changes, primarily adding a new test
-  file `tests/test_git_analysis.py`...
-- **[user]** how does this fit into the rest of the codebase?
-- **[Claude]** Makes 5 tool calls for listing files by chunks, listing chunks,
-  fetching chunk details
-- **[Claude]** I see - the code parses git reflog output to track recently
-  checked out branches. The main function get_recent_branches takes a repo path
-  and returns the last N branches that were checked out, using _branches_from_reflog
-  as a helper to parse the raw reflog output.
-- **[User]** If you were in a rush and had to direct someone to review this work,
-  what would you direct them to look at? Please outline discrete pieces of work
-- **[Claude]**
-  - The `split("to ")[-1]` in _branches_from_reflog looks fragile
-  - Test data only shows simple branch names, what about spaces/special chars?
-  - Is example_reflog representative of real git reflog output?
-  - Are we testing against actual git behavior?
-
-### Work across multiple codebases
-
-You can just ask your LLM to set up multiple projects, and it can freely query
-across them. Handy if one depends on the other, and they're in different repos.
-In this case the LLM should recognise this via imports.
-
 # Limitations
 
 - Sometimes LLM is poor at searching. e.g. search for "dependency", missing
@@ -178,53 +255,60 @@ In this case the LLM should recognise this via imports.
   written with massive codebases in mind - you will see things like all data
   stored in memory, searching done by iterating over all data, various
   things that are screaming out for basic optimisation.
+- Small projects are probably better off with all the code concatenated and
+  thrown into context. MCPunk is really only appropriate where this is impractical.
+- In some cases it would obviously be better to allow the LLM to grab an entire
+  file rather than have it pick out chunks one at a time. MCPunk has no mechanism
+  for this. In practice, I have not found this to be a big issue.
 
 # Configuration
 
-Various things can be configured via environment variables.
+Various things can be configured via environment variables prefixed with `MCPUNK_`.
 For available options, see [settings.py](mcpunk/settings.py) - these are loaded
 from env vars via [Pydantic Settings](https://github.com/pydantic/pydantic-settings).
 
-# Roadmap
+For example, to configure the `include_chars_in_response` option:
 
-MCPunk is at a minimum usable state right now.
+```json
+{
+  "mcpServers": {
+    "MCPunk": {
+      "command": "uvx",
+      "args": ["mcpunk"],
+      "env": {
+        "MCPUNK_INCLUDE_CHARS_IN_RESPONSE": "false"
+      }
+    }
+  }
+}
+```
 
-**Critical Planned functionality**
+# Roadmap & State of Development
 
-- Add a bunch of prompts to help with using MCPunk. Without real "explain how to make
-  a pancake to an alien"-type prompts things do fall a little flat.
+MCPunk is considered near feature complete.
+It has not had broad use, and as a user it is likely you will run into bugs or
+rough edges. Bug reports welcomed at https://github.com/jurasofish/mcpunk/issues
 
-**High up on the roadmap**
+**Roadmap Ideas**
 
+- Add a bunch of prompts to help with using MCPunk. Without real "explain how to make
+  a pancake to an alien"-type prompts things do fall a little flat.
+- Include module-level comments when extracting python module-level statements.
 - Possibly stemming for search
 - Change the whole "project" concept to not need files to actually exist - this
   leads to allowing "virtual" files inside the project.
   - Consider changing files from having a path to having a URI, so coule be like
     `file://...` / `http[s]://` / `gitdiff://` / etc arbitrary URIs
-- Integrate with web searching
-  - Flow like (1) LLM says "search for Caridina water parameters" (2) tool
-    does web search and grabs the 10 highest pages and converts to markdown and
-    chunks them and puts them in the virtual filesystem (3) LLM queries for chunks
-    etc like usual.
 - Chunking of git diffs. Currently, there's a tool to fetch an entire diff. This
   might be very large. Instead, the tool could be changed to `add_diff_to_project`
   and it puts files under the `gitdiff://` URI or under some fake path
-- Include module-level comments when extracting python module-level statements.
 - Caching of a project, so it doesn't need to re-parse all files every time you
   restart MCP client. This may be tricky as changes to the code in a chunker
-  will make cache invalid.
-- Handle changed files sensibly, so you don't need to restart MCP client
-  and re-add project on any file changes
-- Ability to edit files - why not? Can do it like aider where LLM produces a diff.
+  will make cache invalid. Likely not to be prioritised, since it's not that
+  slow for my use cases.
 - Ability for users to provide custom code to perform chunking, perhaps
   similar to [pytest plugins](https://docs.pytest.org/en/stable/how-to/writing_plugins.html#making-your-plugin-installable-by-others)
-
-**Just ideas**
-
 - Something like tree sitter could possibly be used for a more generic chunker
-- Better handling of large chunks
-  - Configurable Max response size for chunks
-  - Log warning for any chunk over max size when initialising project
 - Tracking of characters sent/received, ideally by chat.
 - State, logging, etc by chat
 

diff --git a/assets/setup.png b/assets/setup.png