UN-1625 [FIX] Fixing Llama parse extraction issue by harini-venkataraman · Pull Request #203 · Zipstack/unstract-sdk

harini-venkataraman · 2025-10-29T10:34:24Z

What

PR to fix llama parse extraction. Right now, it extracts only one page.
...

Why

Fixing behaviour while using llama parse.
...

How

...

Relevant Docs

Related Issues or PRs

Dependencies Versions / Env Variables

Notes on Testing

...

Screenshots

...

Checklist

I have read and understood the Contribution Guidelines.

coderabbitai · 2025-10-29T10:34:50Z

Summary by CodeRabbit

Release Notes

Bug Fixes
- Document processing in the LlamaParser adapter now aggregates text from all documents instead of only the first document.

Walkthrough

Bumped SDK version to v0.78.1 and changed the llama_parse adapter to aggregate text from all parsed documents (joined by double newlines) instead of returning only the first document's text.

Changes

Cohort / File(s)	Summary
Version bump `src/unstract/sdk/__init__.py`	Updated `__version__` from `"v0.78.0"` to `"v0.78.1"`.
Document text aggregation `src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py`	`_call_parser` now collects `doc.text` from all `documents`, filters empty texts, and joins them with `\n\n` for the response instead of using `documents[0].text`.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant LlamaParser
    Note over LlamaParser: parse documents -> returns list of Document{text}
    Caller->>LlamaParser: _call_parser(documents)
    LlamaParser-->>LlamaParser: filter empty texts\nmap to doc.text\njoin with "\n\n"
    LlamaParser-->>Caller: response_text (concatenated text)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Verify handling of empty documents and documents with empty text.
Confirm downstream consumers accept the concatenated format and \n\n delimiter.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	While the pull request description follows the required template structure with all sections present, the "How" section is entirely empty (containing only "..."), and the "Notes on Testing" section is also incomplete. The description does adequately address the "What" (fixing llama parse to extract multiple pages instead of one) and "Why" (correcting the behavior), but the critical "How" section fails to explain the implementation details or how the fix achieves its goal. This is a significant omission for code review purposes, as reviewers need to understand the technical approach taken.
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The pull request title "UN-1625 [FIX] Fixing Llama parse extraction issue" directly corresponds to the main change in the changeset. The code modifications demonstrate exactly what the title describes: fixing the Llama parse extraction behavior that was previously limited to extracting only the first document (documents[0].text) and now correctly aggregates text from all documents. The title is specific, clear, and accurately summarizes the primary fix without being vague or overly broad.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/llama-parse-page-extraction

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between eabb24b and 7ea3782.

📒 Files selected for processing (1)

src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1)
92-93: Excellent fix! Now correctly extracts all pages.

The change from using only documents[0].text to concatenating all document texts properly addresses the stated issue. The double newline separator provides clear page boundaries.

Consider adding a defensive check to handle cases where doc.text might be None or empty:
-        response_text = "\n\n".join([doc.text for doc in documents])
+        response_text = "\n\n".join([doc.text for doc in documents if doc.text])
This filters out any None or empty text values, making the code more robust.

Please verify with a multi-page document to ensure all pages are correctly extracted and concatenated.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to Reviews > Disable Cache setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 0681f03 and eabb24b.

📒 Files selected for processing (2)

src/unstract/sdk/__init__.py (1 hunks)
src/unstract/sdk/adapters/x2text/llama_parse/src/llama_parse.py (1 hunks)

🔇 Additional comments (1)

src/unstract/sdk/__init__.py (1)

1-1: LGTM! Appropriate version bump for bug fix.

The patch version increment from v0.78.0 to v0.78.1 is appropriate for this bug fix.

gaya3-zipstack

Looks good.

gaya3-zipstack

Looks good.

harini-venkataraman added 2 commits October 29, 2025 16:01

Fixing llama parse page extraction issue

79f81e6

Fixing llama parse page extraction issue

eabb24b

harini-venkataraman requested review from gaya3-zipstack and pk-zipstack October 29, 2025 10:34

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

jaags-dev approved these changes Oct 29, 2025

View reviewed changes

pk-zipstack approved these changes Oct 29, 2025

View reviewed changes

gaya3-zipstack approved these changes Oct 29, 2025

View reviewed changes

Fixing llama parse page extraction issue

7ea3782

gaya3-zipstack approved these changes Oct 29, 2025

View reviewed changes

harini-venkataraman merged commit a635d13 into main Oct 29, 2025
2 checks passed

harini-venkataraman deleted the fix/llama-parse-page-extraction branch October 29, 2025 11:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UN-1625 [FIX] Fixing Llama parse extraction issue#203

UN-1625 [FIX] Fixing Llama parse extraction issue#203
harini-venkataraman merged 3 commits intomainfrom
fix/llama-parse-page-extraction

harini-venkataraman commented Oct 29, 2025

Uh oh!

coderabbitai bot commented Oct 29, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

gaya3-zipstack left a comment

Uh oh!

gaya3-zipstack left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

harini-venkataraman commented Oct 29, 2025

What

Why

How

Relevant Docs

Related Issues or PRs

Dependencies Versions / Env Variables

Notes on Testing

Screenshots

Checklist

Uh oh!

coderabbitai bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gaya3-zipstack left a comment

Choose a reason for hiding this comment

Uh oh!

gaya3-zipstack left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coderabbitai bot commented Oct 29, 2025 •

edited

Loading