UN-1625 [FIX] Fixing Llama parse extraction issue#203
UN-1625 [FIX] Fixing Llama parse extraction issue#203harini-venkataraman merged 3 commits intomainfrom
Conversation
Summary by CodeRabbitRelease Notes
WalkthroughBumped SDK version to v0.78.1 and changed the llama_parse adapter to aggregate text from all parsed documents (joined by double newlines) instead of returning only the first document's text. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant LlamaParser
Note over LlamaParser: parse documents -> returns list of Document{text}
Caller->>LlamaParser: _call_parser(documents)
LlamaParser-->>LlamaParser: filter empty texts\nmap to doc.text\njoin with "\n\n"
LlamaParser-->>Caller: response_text (concatenated text)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro Cache: Disabled due to Reviews > Disable Cache setting Knowledge base: Disabled due to 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
92-93: Excellent fix! Now correctly extracts all pages.The change from using only
documents[0].textto concatenating all document texts properly addresses the stated issue. The double newline separator provides clear page boundaries.Consider adding a defensive check to handle cases where
doc.textmight beNoneor empty:- response_text = "\n\n".join([doc.text for doc in documents]) + response_text = "\n\n".join([doc.text for doc in documents if doc.text])This filters out any
Noneor empty text values, making the code more robust.Please verify with a multi-page document to ensure all pages are correctly extracted and concatenated.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to Reviews > Disable Cache setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
src/unstract/sdk/__init__.py(1 hunks)src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py(1 hunks)
🔇 Additional comments (1)
src/unstract/sdk/__init__.py (1)
1-1: LGTM! Appropriate version bump for bug fix.The patch version increment from v0.78.0 to v0.78.1 is appropriate for this bug fix.
What
PR to fix llama parse extraction. Right now, it extracts only one page.
...
Why
Fixing behaviour while using llama parse.
...
How
...
Relevant Docs
Related Issues or PRs
Dependencies Versions / Env Variables
Notes on Testing
...
Screenshots
...
Checklist
I have read and understood the Contribution Guidelines.