Feature/file support by coloboxp · Pull Request #39 · shinpr/mcp-local-rag

coloboxp · 2026-01-31T15:44:31Z

Summary

Add Office (docx/pptx/xlsx/xls), source code, and config formats (json/yaml/toml/ini/settings).
Add custom parser config with clear load/parse error hints.
Add a companion CLI for bulk ingest to avoid MCP timeouts, useful to leave it overnight or running long time.
Default excludes prevent scanning node_modules, target, bin, obj, and other build folders.
Skip common dependency/build folders by default in directory scans (MCP + CLI).
Update README + changelog.

Why?

MCP tool timeouts make large folder ingestion unreliable.
Users need a simple way to add new/custom formats without changing core code.

MCP configuration (example)

{
  "mcpServers": {
    "local-rag": {
      "command": "npx",
      "args": ["-y", "mcp-local-rag"],
      "env": {
        "BASE_DIR": "/Users/me",
        "DB_PATH": "/Users/me/.local/share/mcp-local-rag/lancedb",
        "CACHE_DIR": "/Users/me/.cache/mcp-local-rag/models",
        "MODEL_NAME": "Xenova/all-MiniLM-L6-v2",
        "MCP_LOCAL_RAG_PARSERS": "/Users/me/.config/mcp-local-rag/file_parsers.json"
      }
    }
  }
}

CLI demo

  npx mcp-local-rag ingest --path /Users/me/Desktop
  npx mcp-local-rag ingest --path /Users/me/Desktop --extensions .pdf,.md
  npx mcp-local-rag ingest --path /Users/me/Desktop --exclude node_modules,dist

Setting a custom aprser(works in MCP + CLI)

Edit ~/.config/mcp-local-rag/file_parsers.json

  {
    ".note": {
      "module": "/Users/me/.config/mcp-local-rag/parsers/note-parser.js",
      "export": "parseFile"
    }
  }

Example of custom parser

note-parser.js

  import fs from "node:fs/promises"

  export async function parseFile(filePath) {
    // do your magic/custom parser here
    return return the text here.
  }

Set file_parsers.json catalog for MCP

MCP_LOCAL_RAG_PARSERS=/Users/me/.config/mcp-local-rag/file_parsers.json

CLI

npx mcp-local-rag ingest --path /Users/me/Desktop --parsers /Users/me/.config/mcp-local-rag/file_parsers.json

Tests

pnpm run check:all

shinpr · 2026-02-01T03:22:00Z

Hi, thanks for this contribution! I can see the effort that went into addressing MCP timeout issues and expanding file support. That said, I have some concerns about scope and security that I'd like to discuss before we can move forward.

Could we split this PR?

This PR bundles several independent features together, which makes it harder to review and increases the risk of issues slipping through. Would you be open to splitting it into smaller, focused PRs?

Suggested split:

PR A: Bulk Ingest CLI + Directory Ingest + Default Excludes
PR B: (see discussion below) Extended file format support

This would make it easier to merge the parts that are ready while we discuss the others.

Feature-by-Feature Feedback

1. Bulk Ingest CLI

Status: Happy to accept with some changes

The CLI addresses a real pain point—MCP tool timeouts make large-scale ingestion frustrating. I'd love to see this merged, but there are a few things we need to address first.

Changes needed:

BASE_DIR consistency

The CLI currently sets baseDir to whatever path is passed in (line 819). This creates a potential mismatch: if someone ingests files via CLI outside the MCP's BASE_DIR, they won't be able to re-ingest those files through MCP later.

I think we need to either:
- Document this clearly in the README (something like "make sure --base-dir matches your MCP's BASE_DIR")
- Or require --base-dir explicitly and warn if it's not set
Duplicated ingest logic (SSoT)

The ingest pipeline (parse → chunk → embed → insert) is now in two places: the CLI and RAGServer.handleIngestFile. This will become a maintenance burden over time.

Could we extract a shared ingestFile() function that both can call? This way, bug fixes and improvements only need to happen in one place.
Tests

We'll need unit tests for the CLI—at minimum for argument parsing and the main ingest flow.
README updates

The README should document the CLI usage and the BASE_DIR considerations mentioned above. Would you mind adding that?

2. Directory Ingest (MCP Tool)

Status: Would prefer a different approach

I see what this is trying to achieve—letting users ingest a whole folder at once is convenient. But I'm hesitant about overloading ingest_file to handle both files and directories. It makes the MCP tool surface harder to reason about, both for users and for LLMs that call these tools.

Concerns:

The tool description becomes long and tries to explain two different behaviors
LLMs might get confused about when to use file paths vs directory paths
Single responsibility: one tool doing two different things

Suggestion:

Would you consider creating a separate ingest_directory tool instead? This keeps each tool focused and makes the API clearer.

3. Default Excludes

Status: Good idea, but needs visibility

Automatically skipping node_modules, dist, target, etc. is sensible—nobody wants to accidentally ingest 50,000 files from their dependencies folder.

One concern:

Right now, the exclusions happen silently. If someone asks to ingest /Users/me/project and we skip node_modules, they have no way of knowing that happened. This could be confusing.

Could we include something in the response like "Processed 150 files (skipped 3 directories: node_modules, dist, .git)"? That way users know what happened.

4. Custom Parser Support

Status: Not planned for now

I've thought about this one quite a bit. On the surface, custom parsers seem useful—they'd let users add support for any file format without changing the core code. But I have reservations.

Why I'm hesitant:

Low demand, high risk

Most users are fine with PDF/DOCX/TXT/MD. The users who need exotic formats are a small minority, and they have an alternative: convert to text first, then use ingest_data.
Security concerns

Custom parsers can execute arbitrary code with full filesystem and network access. This conflicts with our project's emphasis on privacy and local-only operation. The README promises "Path restriction: Only files within BASE_DIR are accessible" and "Local only: No network requests after model download"—custom parsers could violate both.
Support burden

When someone's custom parser breaks, the first place they'll look is our issue tracker. We can't debug or guarantee code we didn't write.
No sandboxing

If we were to add this, I'd want proper isolation—restricted I/O, timeouts, maybe vm or isolated-vm. But that's significant work for a feature with limited demand.

For now, I'd prefer to leave this out. If there's strong community demand in the future, we could revisit with a properly sandboxed design.

5. Excel Support (XLSX/XLS)

Status: Not a good fit for this project

I understand why Excel support is appealing—spreadsheets are everywhere. But I don't think it fits well with what this project is trying to do.

My concerns:

Semantic chunking doesn't work well with tabular data

Our chunker is designed for prose—it looks for topic boundaries and groups related sentences. Spreadsheet data like Date,Product,Amount\n2024-01-01,Apple,100 doesn't have that structure. The chunks would be essentially meaningless.
Package quality issues

The xlsx package on npm is outdated (last update: March 2022) and has known security vulnerabilities. ExcelJS is better maintained, but this feels like the wrong dependency to add for a feature that doesn't fit the use case well.
Alternative exists

Users who really need this can export to CSV and ingest as text, or wait for a custom parser feature (if we add it in the future with proper sandboxing).

6. PowerPoint Support (PPTX)

Status: Not production-ready

The implementation uses regex to parse XML (/<a:t[^>]*>(.*?)<\/a:t>/g), which is fragile. It will break on complex PPTX files with nested elements or unusual structures.

If we were to add PPTX support, it would need a proper XML parser. But given the unclear use case for semantic search over presentation slides, I don't think it's worth the effort right now.

7. Source Code Support (.ts, .py, .go, etc.)

Status: Doesn't align with this project's scope

This project is designed for document search—PDFs, Word docs, technical specs. Code search is a fundamentally different problem.

Why it doesn't fit:

Semantic search on code produces poor results. Variable names and function signatures don't embed well.
Tools like grep, ripgrep, and IDE search are much better suited for code.
We'd be adding 40+ extensions without proper parsing—just reading files as text.

I think code search deserves its own specialized tool, not an extension of a document RAG system.

8. Config File Support (JSON, YAML, TOML, INI)

Status: Doesn't align with this project's scope

Similar to source code—config files are structured data, not prose. Semantic chunking doesn't produce meaningful results.

If we wanted to support these properly, we'd need to parse the structure and extract values in a way that makes sense for search. Just reading them as text doesn't achieve that.

Summary

Feature	Status	Notes
Bulk Ingest CLI	✅ Accept with changes	Fix BASE_DIR handling, extract shared logic, add tests, update README
Directory Ingest	🔄 Redesign needed	Create separate `ingest_directory` tool
Default Excludes	✅ Accept with changes	Add visibility (report what was skipped)
Custom Parser	❌ Not planned	Low demand, high risk, no sandboxing
Excel	❌ Not a fit	Tabular data doesn't work with semantic chunking
PowerPoint	❌ Not ready	Regex XML parsing isn't robust enough
Source Code	❌ Out of scope	Code search needs different tools
Config Files	❌ Out of scope	Structured data doesn't chunk well

Suggested Path Forward

If you're willing to split this PR, I'd be happy to work with you on:

First PR: Bulk CLI + Directory tool (as separate ingest_directory) + Default excludes with visibility

This addresses the real pain point (MCP timeouts) while keeping the scope focused.
Future discussion: If there's community interest in format expansion, we could discuss a properly designed plugin system with sandboxing. But I'd want to see demand first.

Thanks again for the contribution. I know this is a lot of feedback—happy to discuss any of these points further.

Alejandro Aleman added 2 commits January 31, 2026 12:01

Add broader file support and directory ingest

bf03aec

feat: expand file support and add bulk ingest cli

87cd378

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/file support#39

Feature/file support#39
coloboxp wants to merge 2 commits intoshinpr:mainfrom
coloboxp:feature/file-support

coloboxp commented Jan 31, 2026

Uh oh!

shinpr commented Feb 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coloboxp commented Jan 31, 2026

Summary

Why?

MCP configuration (example)

CLI demo

Setting a custom aprser(works in MCP + CLI)

Example of custom parser

note-parser.js

Set file_parsers.json catalog for MCP

CLI

Tests

Uh oh!

shinpr commented Feb 1, 2026

Could we split this PR?

Feature-by-Feature Feedback

1. Bulk Ingest CLI

2. Directory Ingest (MCP Tool)

3. Default Excludes

4. Custom Parser Support

5. Excel Support (XLSX/XLS)

6. PowerPoint Support (PPTX)

7. Source Code Support (.ts, .py, .go, etc.)

8. Config File Support (JSON, YAML, TOML, INI)

Summary

Suggested Path Forward

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants