Conversation
|
Hi, thanks for this contribution! I can see the effort that went into addressing MCP timeout issues and expanding file support. That said, I have some concerns about scope and security that I'd like to discuss before we can move forward. Could we split this PR?This PR bundles several independent features together, which makes it harder to review and increases the risk of issues slipping through. Would you be open to splitting it into smaller, focused PRs? Suggested split:
This would make it easier to merge the parts that are ready while we discuss the others. Feature-by-Feature Feedback1. Bulk Ingest CLIStatus: Happy to accept with some changes The CLI addresses a real pain point—MCP tool timeouts make large-scale ingestion frustrating. I'd love to see this merged, but there are a few things we need to address first. Changes needed:
2. Directory Ingest (MCP Tool)Status: Would prefer a different approach I see what this is trying to achieve—letting users ingest a whole folder at once is convenient. But I'm hesitant about overloading Concerns:
Suggestion: Would you consider creating a separate 3. Default ExcludesStatus: Good idea, but needs visibility Automatically skipping One concern: Right now, the exclusions happen silently. If someone asks to ingest Could we include something in the response like "Processed 150 files (skipped 3 directories: node_modules, dist, .git)"? That way users know what happened. 4. Custom Parser SupportStatus: Not planned for now I've thought about this one quite a bit. On the surface, custom parsers seem useful—they'd let users add support for any file format without changing the core code. But I have reservations. Why I'm hesitant:
For now, I'd prefer to leave this out. If there's strong community demand in the future, we could revisit with a properly sandboxed design. 5. Excel Support (XLSX/XLS)Status: Not a good fit for this project I understand why Excel support is appealing—spreadsheets are everywhere. But I don't think it fits well with what this project is trying to do. My concerns:
6. PowerPoint Support (PPTX)Status: Not production-ready The implementation uses regex to parse XML ( If we were to add PPTX support, it would need a proper XML parser. But given the unclear use case for semantic search over presentation slides, I don't think it's worth the effort right now. 7. Source Code Support (.ts, .py, .go, etc.)Status: Doesn't align with this project's scope This project is designed for document search—PDFs, Word docs, technical specs. Code search is a fundamentally different problem. Why it doesn't fit:
I think code search deserves its own specialized tool, not an extension of a document RAG system. 8. Config File Support (JSON, YAML, TOML, INI)Status: Doesn't align with this project's scope Similar to source code—config files are structured data, not prose. Semantic chunking doesn't produce meaningful results. If we wanted to support these properly, we'd need to parse the structure and extract values in a way that makes sense for search. Just reading them as text doesn't achieve that. Summary
Suggested Path ForwardIf you're willing to split this PR, I'd be happy to work with you on:
Thanks again for the contribution. I know this is a lot of feedback—happy to discuss any of these points further. |
Summary
node_modules,target,bin,obj, and other build folders.Why?
MCP configuration (example)
{ "mcpServers": { "local-rag": { "command": "npx", "args": ["-y", "mcp-local-rag"], "env": { "BASE_DIR": "/Users/me", "DB_PATH": "/Users/me/.local/share/mcp-local-rag/lancedb", "CACHE_DIR": "/Users/me/.cache/mcp-local-rag/models", "MODEL_NAME": "Xenova/all-MiniLM-L6-v2", "MCP_LOCAL_RAG_PARSERS": "/Users/me/.config/mcp-local-rag/file_parsers.json" } } } }CLI demo
Setting a custom aprser(works in MCP + CLI)
Edit
~/.config/mcp-local-rag/file_parsers.json{ ".note": { "module": "/Users/me/.config/mcp-local-rag/parsers/note-parser.js", "export": "parseFile" } }Example of custom parser
note-parser.js
Set file_parsers.json catalog for MCP
MCP_LOCAL_RAG_PARSERS=/Users/me/.config/mcp-local-rag/file_parsers.jsonCLI
Tests
pnpm run check:all