Skip to content

fix: add timeout parameter to MinerU subprocess to prevent indefinite hang#254

Merged
LarFii merged 3 commits into
HKUDS:mainfrom
peterCheng123321:fix/172-mineru-subprocess-timeout
Apr 25, 2026
Merged

fix: add timeout parameter to MinerU subprocess to prevent indefinite hang#254
LarFii merged 3 commits into
HKUDS:mainfrom
peterCheng123321:fix/172-mineru-subprocess-timeout

Conversation

@peterCheng123321
Copy link
Copy Markdown

@peterCheng123321 peterCheng123321 commented Apr 21, 2026

Summary

Fixes #172.

When MinerU tries to download model weights on first run and the network is unavailable or slow, the subprocess poll loop ran forever with no way to interrupt it. Users in LAN/offline environments were stuck with no way to abort.

Fix: add an optional timeout (seconds) parameter to _run_mineru_command. If the subprocess does not finish within the deadline, it is killed and a TimeoutError is raised with a message pointing users to check their network or pre-download models.

The parameter flows naturally through **kwargs so no API changes are needed for existing callers:

# Opt in with a 10-minute deadline
await rag.process_document_complete("doc.pdf", timeout=600)

# Or in parse_pdf directly
content_list = parser.parse_pdf("doc.pdf", output_dir="./out", timeout=300)

Test plan

  • Call process_document_complete with timeout=5 on a large PDF that would normally take longer — confirm TimeoutError is raised and the subprocess is killed
  • Call without timeout — confirm existing behaviour is unchanged
  • Verify the error message mentions network/model download as the likely cause

… hang

When MinerU tries to download model weights and the network is unavailable
or slow, the subprocess poll loop ran forever with no way to interrupt it.

Add an optional timeout (seconds) parameter to _run_mineru_command. If the
process does not finish within the deadline the subprocess is killed and a
TimeoutError is raised with a clear message pointing users to check their
network connection or pre-download the models.

The parameter flows naturally from process_document_complete / parse_pdf
through **kwargs, so callers can opt in:

    await rag.process_document_complete("doc.pdf", timeout=600)

Fixes HKUDS#172

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

@Abdeltoto Abdeltoto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @peterCheng123321 — this is a real pain point for users on slow networks or behind corporate proxies, glad to see it addressed. The implementation is well thought out:

  • time.monotonic() (not time.time()) — correct choice, immune to wall-clock changes.
  • process.kill() followed by process.wait() — properly reaps the subprocess.
  • ✅ Built-in TimeoutError rather than a custom exception — keeps the surface small.
  • ✅ Default None preserves backward compatibility, opt-in via **kwargs plumbing.
  • ✅ The error message points users at the most likely root cause (model download).

LGTM 👍

Two small suggestions, both non-blocking:

  1. Duplicate import time: the new import time at line ~770 is good (lifts it out of the hot loop), but the original import time inside the loop body still exists in the diff context. If that import is no longer needed inside the loop, removing it would tidy the function. (If it's already removed in the actual file and just hasn't shown in the diff context I'm reading, please ignore.)

  2. Stdout/stderr threads on timeout: when process.kill() fires, the daemon stdout/stderr reader threads keep running until the killed process's pipes close. In practice this resolves quickly, but a stdout_thread.join(timeout=1) and stderr_thread.join(timeout=1) right after the kill would make the shutdown deterministic and avoid any chance of partial output dribbling in after the TimeoutError is raised.

A test would be nice but honestly hard to add without slowing CI — a subprocess.Popen mock that sleeps longer than the timeout would do it, but I don't think this should block merge.

…meoutError

Addresses review feedback on HKUDS#254:
- import time was inside the try block; move it to module-level with the
  other stdlib imports
- after process.kill()/wait() on timeout, join the stdout/stderr reader
  threads with a 1 s deadline so pipe output stops dribbling in after the
  TimeoutError is raised and shutdown is deterministic

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@peterCheng123321
Copy link
Copy Markdown
Author

Thanks for the thorough review @Abdeltoto! Both suggestions applied in the latest commit:

  1. Moved import time to module level with the other stdlib imports
  2. Added stdout_thread.join(timeout=1) and stderr_thread.join(timeout=1) right after process.wait() on timeout so thread teardown is deterministic before the TimeoutError propagates

@LarFii LarFii merged commit 5959335 into HKUDS:main Apr 25, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]:The process got stuck at rag.process_document_complete due to network issues.

4 participants