Skip to content

UN-1625 [FIX] Fixing Llama parse extraction issue#203

Merged
harini-venkataraman merged 3 commits intomainfrom
fix/llama-parse-page-extraction
Oct 29, 2025
Merged

UN-1625 [FIX] Fixing Llama parse extraction issue#203
harini-venkataraman merged 3 commits intomainfrom
fix/llama-parse-page-extraction

Conversation

@harini-venkataraman
Copy link
Copy Markdown
Contributor

What

PR to fix llama parse extraction. Right now, it extracts only one page.
...

Why

Fixing behaviour while using llama parse.
...

How

...

Relevant Docs

Related Issues or PRs

Dependencies Versions / Env Variables

Notes on Testing

...

Screenshots

...

Checklist

I have read and understood the Contribution Guidelines.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 29, 2025

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Document processing in the LlamaParser adapter now aggregates text from all documents instead of only the first document.

Walkthrough

Bumped SDK version to v0.78.1 and changed the llama_parse adapter to aggregate text from all parsed documents (joined by double newlines) instead of returning only the first document's text.

Changes

Cohort / File(s) Summary
Version bump
src/unstract/sdk/__init__.py
Updated __version__ from "v0.78.0" to "v0.78.1".
Document text aggregation
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py
_call_parser now collects doc.text from all documents, filters empty texts, and joins them with \n\n for the response instead of using documents[0].text.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant LlamaParser
    Note over LlamaParser: parse documents -> returns list of Document{text}
    Caller->>LlamaParser: _call_parser(documents)
    LlamaParser-->>LlamaParser: filter empty texts\nmap to doc.text\njoin with "\n\n"
    LlamaParser-->>Caller: response_text (concatenated text)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Verify handling of empty documents and documents with empty text.
  • Confirm downstream consumers accept the concatenated format and \n\n delimiter.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning While the pull request description follows the required template structure with all sections present, the "How" section is entirely empty (containing only "..."), and the "Notes on Testing" section is also incomplete. The description does adequately address the "What" (fixing llama parse to extract multiple pages instead of one) and "Why" (correcting the behavior), but the critical "How" section fails to explain the implementation details or how the fix achieves its goal. This is a significant omission for code review purposes, as reviewers need to understand the technical approach taken.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The pull request title "UN-1625 [FIX] Fixing Llama parse extraction issue" directly corresponds to the main change in the changeset. The code modifications demonstrate exactly what the title describes: fixing the Llama parse extraction behavior that was previously limited to extracting only the first document (documents[0].text) and now correctly aggregates text from all documents. The title is specific, clear, and accurately summarizes the primary fix without being vague or overly broad.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/llama-parse-page-extraction

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between eabb24b and 7ea3782.

📒 Files selected for processing (1)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)

92-93: Excellent fix! Now correctly extracts all pages.

The change from using only documents[0].text to concatenating all document texts properly addresses the stated issue. The double newline separator provides clear page boundaries.

Consider adding a defensive check to handle cases where doc.text might be None or empty:

-        response_text = "\n\n".join([doc.text for doc in documents])
+        response_text = "\n\n".join([doc.text for doc in documents if doc.text])

This filters out any None or empty text values, making the code more robust.

Please verify with a multi-page document to ensure all pages are correctly extracted and concatenated.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0681f03 and eabb24b.

📒 Files selected for processing (2)
  • src/unstract/sdk/__init__.py (1 hunks)
  • src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1 hunks)
🔇 Additional comments (1)
src/unstract/sdk/__init__.py (1)

1-1: LGTM! Appropriate version bump for bug fix.

The patch version increment from v0.78.0 to v0.78.1 is appropriate for this bug fix.

Copy link
Copy Markdown
Contributor

@gaya3-zipstack gaya3-zipstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

Copy link
Copy Markdown
Contributor

@gaya3-zipstack gaya3-zipstack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@harini-venkataraman harini-venkataraman merged commit a635d13 into main Oct 29, 2025
2 checks passed
@harini-venkataraman harini-venkataraman deleted the fix/llama-parse-page-extraction branch October 29, 2025 11:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants