fix: add timeout parameter to MinerU subprocess to prevent indefinite hang#254
Conversation
… hang
When MinerU tries to download model weights and the network is unavailable
or slow, the subprocess poll loop ran forever with no way to interrupt it.
Add an optional timeout (seconds) parameter to _run_mineru_command. If the
process does not finish within the deadline the subprocess is killed and a
TimeoutError is raised with a clear message pointing users to check their
network connection or pre-download the models.
The parameter flows naturally from process_document_complete / parse_pdf
through **kwargs, so callers can opt in:
await rag.process_document_complete("doc.pdf", timeout=600)
Fixes HKUDS#172
Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Abdeltoto
left a comment
There was a problem hiding this comment.
Thanks @peterCheng123321 — this is a real pain point for users on slow networks or behind corporate proxies, glad to see it addressed. The implementation is well thought out:
- ✅
time.monotonic()(nottime.time()) — correct choice, immune to wall-clock changes. - ✅
process.kill()followed byprocess.wait()— properly reaps the subprocess. - ✅ Built-in
TimeoutErrorrather than a custom exception — keeps the surface small. - ✅ Default
Nonepreserves backward compatibility, opt-in via**kwargsplumbing. - ✅ The error message points users at the most likely root cause (model download).
LGTM 👍
Two small suggestions, both non-blocking:
-
Duplicate
import time: the newimport timeat line ~770 is good (lifts it out of the hot loop), but the originalimport timeinside the loop body still exists in the diff context. If that import is no longer needed inside the loop, removing it would tidy the function. (If it's already removed in the actual file and just hasn't shown in the diff context I'm reading, please ignore.) -
Stdout/stderr threads on timeout: when
process.kill()fires, the daemon stdout/stderr reader threads keep running until the killed process's pipes close. In practice this resolves quickly, but astdout_thread.join(timeout=1)andstderr_thread.join(timeout=1)right after the kill would make the shutdown deterministic and avoid any chance of partial output dribbling in after theTimeoutErroris raised.
A test would be nice but honestly hard to add without slowing CI — a subprocess.Popen mock that sleeps longer than the timeout would do it, but I don't think this should block merge.
…meoutError Addresses review feedback on HKUDS#254: - import time was inside the try block; move it to module-level with the other stdlib imports - after process.kill()/wait() on timeout, join the stdout/stderr reader threads with a 1 s deadline so pipe output stops dribbling in after the TimeoutError is raised and shutdown is deterministic Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
|
Thanks for the thorough review @Abdeltoto! Both suggestions applied in the latest commit:
|
Made-with: Cursor
Summary
Fixes #172.
When MinerU tries to download model weights on first run and the network is unavailable or slow, the subprocess poll loop ran forever with no way to interrupt it. Users in LAN/offline environments were stuck with no way to abort.
Fix: add an optional
timeout(seconds) parameter to_run_mineru_command. If the subprocess does not finish within the deadline, it is killed and aTimeoutErroris raised with a message pointing users to check their network or pre-download models.The parameter flows naturally through
**kwargsso no API changes are needed for existing callers:Test plan
process_document_completewithtimeout=5on a large PDF that would normally take longer — confirmTimeoutErroris raised and the subprocess is killedtimeout— confirm existing behaviour is unchanged