Skip to content

Feature/file support#39

Open
coloboxp wants to merge 2 commits intoshinpr:mainfrom
coloboxp:feature/file-support
Open

Feature/file support#39
coloboxp wants to merge 2 commits intoshinpr:mainfrom
coloboxp:feature/file-support

Conversation

@coloboxp
Copy link

Summary

  • Add Office (docx/pptx/xlsx/xls), source code, and config formats (json/yaml/toml/ini/settings).
  • Add custom parser config with clear load/parse error hints.
  • Add a companion CLI for bulk ingest to avoid MCP timeouts, useful to leave it overnight or running long time.
  • Default excludes prevent scanning node_modules, target, bin, obj, and other build folders.
  • Skip common dependency/build folders by default in directory scans (MCP + CLI).
  • Update README + changelog.

Why?

  • MCP tool timeouts make large folder ingestion unreliable.
  • Users need a simple way to add new/custom formats without changing core code.

MCP configuration (example)

{
  "mcpServers": {
    "local-rag": {
      "command": "npx",
      "args": ["-y", "mcp-local-rag"],
      "env": {
        "BASE_DIR": "/Users/me",
        "DB_PATH": "/Users/me/.local/share/mcp-local-rag/lancedb",
        "CACHE_DIR": "/Users/me/.cache/mcp-local-rag/models",
        "MODEL_NAME": "Xenova/all-MiniLM-L6-v2",
        "MCP_LOCAL_RAG_PARSERS": "/Users/me/.config/mcp-local-rag/file_parsers.json"
      }
    }
  }
}

CLI demo

  npx mcp-local-rag ingest --path /Users/me/Desktop
  npx mcp-local-rag ingest --path /Users/me/Desktop --extensions .pdf,.md
  npx mcp-local-rag ingest --path /Users/me/Desktop --exclude node_modules,dist

Setting a custom aprser(works in MCP + CLI)

Edit ~/.config/mcp-local-rag/file_parsers.json

  {
    ".note": {
      "module": "/Users/me/.config/mcp-local-rag/parsers/note-parser.js",
      "export": "parseFile"
    }
  }

Example of custom parser

note-parser.js

  import fs from "node:fs/promises"

  export async function parseFile(filePath) {
    // do your magic/custom parser here
    return return the text here.
  }

Set file_parsers.json catalog for MCP

MCP_LOCAL_RAG_PARSERS=/Users/me/.config/mcp-local-rag/file_parsers.json

CLI

npx mcp-local-rag ingest --path /Users/me/Desktop --parsers /Users/me/.config/mcp-local-rag/file_parsers.json

Tests

  • pnpm run check:all

@shinpr
Copy link
Owner

shinpr commented Feb 1, 2026

Hi, thanks for this contribution! I can see the effort that went into addressing MCP timeout issues and expanding file support. That said, I have some concerns about scope and security that I'd like to discuss before we can move forward.

Could we split this PR?

This PR bundles several independent features together, which makes it harder to review and increases the risk of issues slipping through. Would you be open to splitting it into smaller, focused PRs?

Suggested split:

  1. PR A: Bulk Ingest CLI + Directory Ingest + Default Excludes
  2. PR B: (see discussion below) Extended file format support

This would make it easier to merge the parts that are ready while we discuss the others.

Feature-by-Feature Feedback

1. Bulk Ingest CLI

Status: Happy to accept with some changes

The CLI addresses a real pain point—MCP tool timeouts make large-scale ingestion frustrating. I'd love to see this merged, but there are a few things we need to address first.

Changes needed:

  1. BASE_DIR consistency

    The CLI currently sets baseDir to whatever path is passed in (line 819). This creates a potential mismatch: if someone ingests files via CLI outside the MCP's BASE_DIR, they won't be able to re-ingest those files through MCP later.

    I think we need to either:

    • Document this clearly in the README (something like "make sure --base-dir matches your MCP's BASE_DIR")
    • Or require --base-dir explicitly and warn if it's not set
  2. Duplicated ingest logic (SSoT)

    The ingest pipeline (parse → chunk → embed → insert) is now in two places: the CLI and RAGServer.handleIngestFile. This will become a maintenance burden over time.

    Could we extract a shared ingestFile() function that both can call? This way, bug fixes and improvements only need to happen in one place.

  3. Tests

    We'll need unit tests for the CLI—at minimum for argument parsing and the main ingest flow.

  4. README updates

    The README should document the CLI usage and the BASE_DIR considerations mentioned above. Would you mind adding that?

2. Directory Ingest (MCP Tool)

Status: Would prefer a different approach

I see what this is trying to achieve—letting users ingest a whole folder at once is convenient. But I'm hesitant about overloading ingest_file to handle both files and directories. It makes the MCP tool surface harder to reason about, both for users and for LLMs that call these tools.

Concerns:

  • The tool description becomes long and tries to explain two different behaviors
  • LLMs might get confused about when to use file paths vs directory paths
  • Single responsibility: one tool doing two different things

Suggestion:

Would you consider creating a separate ingest_directory tool instead? This keeps each tool focused and makes the API clearer.

3. Default Excludes

Status: Good idea, but needs visibility

Automatically skipping node_modules, dist, target, etc. is sensible—nobody wants to accidentally ingest 50,000 files from their dependencies folder.

One concern:

Right now, the exclusions happen silently. If someone asks to ingest /Users/me/project and we skip node_modules, they have no way of knowing that happened. This could be confusing.

Could we include something in the response like "Processed 150 files (skipped 3 directories: node_modules, dist, .git)"? That way users know what happened.

4. Custom Parser Support

Status: Not planned for now

I've thought about this one quite a bit. On the surface, custom parsers seem useful—they'd let users add support for any file format without changing the core code. But I have reservations.

Why I'm hesitant:

  1. Low demand, high risk

    Most users are fine with PDF/DOCX/TXT/MD. The users who need exotic formats are a small minority, and they have an alternative: convert to text first, then use ingest_data.

  2. Security concerns

    Custom parsers can execute arbitrary code with full filesystem and network access. This conflicts with our project's emphasis on privacy and local-only operation. The README promises "Path restriction: Only files within BASE_DIR are accessible" and "Local only: No network requests after model download"—custom parsers could violate both.

  3. Support burden

    When someone's custom parser breaks, the first place they'll look is our issue tracker. We can't debug or guarantee code we didn't write.

  4. No sandboxing

    If we were to add this, I'd want proper isolation—restricted I/O, timeouts, maybe vm or isolated-vm. But that's significant work for a feature with limited demand.

For now, I'd prefer to leave this out. If there's strong community demand in the future, we could revisit with a properly sandboxed design.

5. Excel Support (XLSX/XLS)

Status: Not a good fit for this project

I understand why Excel support is appealing—spreadsheets are everywhere. But I don't think it fits well with what this project is trying to do.

My concerns:

  1. Semantic chunking doesn't work well with tabular data

    Our chunker is designed for prose—it looks for topic boundaries and groups related sentences. Spreadsheet data like Date,Product,Amount\n2024-01-01,Apple,100 doesn't have that structure. The chunks would be essentially meaningless.

  2. Package quality issues

    The xlsx package on npm is outdated (last update: March 2022) and has known security vulnerabilities. ExcelJS is better maintained, but this feels like the wrong dependency to add for a feature that doesn't fit the use case well.

  3. Alternative exists

    Users who really need this can export to CSV and ingest as text, or wait for a custom parser feature (if we add it in the future with proper sandboxing).

6. PowerPoint Support (PPTX)

Status: Not production-ready

The implementation uses regex to parse XML (/<a:t[^>]*>(.*?)<\/a:t>/g), which is fragile. It will break on complex PPTX files with nested elements or unusual structures.

If we were to add PPTX support, it would need a proper XML parser. But given the unclear use case for semantic search over presentation slides, I don't think it's worth the effort right now.

7. Source Code Support (.ts, .py, .go, etc.)

Status: Doesn't align with this project's scope

This project is designed for document search—PDFs, Word docs, technical specs. Code search is a fundamentally different problem.

Why it doesn't fit:

  • Semantic search on code produces poor results. Variable names and function signatures don't embed well.
  • Tools like grep, ripgrep, and IDE search are much better suited for code.
  • We'd be adding 40+ extensions without proper parsing—just reading files as text.

I think code search deserves its own specialized tool, not an extension of a document RAG system.

8. Config File Support (JSON, YAML, TOML, INI)

Status: Doesn't align with this project's scope

Similar to source code—config files are structured data, not prose. Semantic chunking doesn't produce meaningful results.

If we wanted to support these properly, we'd need to parse the structure and extract values in a way that makes sense for search. Just reading them as text doesn't achieve that.

Summary

Feature Status Notes
Bulk Ingest CLI ✅ Accept with changes Fix BASE_DIR handling, extract shared logic, add tests, update README
Directory Ingest 🔄 Redesign needed Create separate ingest_directory tool
Default Excludes ✅ Accept with changes Add visibility (report what was skipped)
Custom Parser ❌ Not planned Low demand, high risk, no sandboxing
Excel ❌ Not a fit Tabular data doesn't work with semantic chunking
PowerPoint ❌ Not ready Regex XML parsing isn't robust enough
Source Code ❌ Out of scope Code search needs different tools
Config Files ❌ Out of scope Structured data doesn't chunk well

Suggested Path Forward

If you're willing to split this PR, I'd be happy to work with you on:

  1. First PR: Bulk CLI + Directory tool (as separate ingest_directory) + Default excludes with visibility

    This addresses the real pain point (MCP timeouts) while keeping the scope focused.

  2. Future discussion: If there's community interest in format expansion, we could discuss a properly designed plugin system with sandboxing. But I'd want to see demand first.

Thanks again for the contribution. I know this is a lot of feedback—happy to discuss any of these points further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants