skill-creator: fix run_eval.py crash on Windows when reading from subprocess pipe#1099
skill-creator: fix run_eval.py crash on Windows when reading from subprocess pipe#1099joshuawowk wants to merge 1 commit into
Conversation
…process pipe select.select() on a subprocess stdout pipe raises OSError [WinError 10038] on Windows because Windows' select() only accepts socket fds. The except handler in run_eval converted that into a 'query failed: ...' warning and recorded every result as not-triggered, so the optimization loop reported precision=100% recall=0% regardless of the description being tested. Replace the select-based polling with a small daemon thread that reads stdout into a queue and a queue.get(timeout=1.0) on the main loop. Same time budget, same back-pressure, works identically on POSIX. Verified on Windows 11 (Python 3.13.12, claude 2.1.101) with a 20-query eval set: previously 0 of 10 should-trigger queries triggered; with this patch the run completes end-to-end and trigger rates are real.
|
Independent confirmation of this bug pattern from another Windows Where I hit it: running Symptom: Convergent fix: I patched my local copy before finding this PR. One small implementation delta (yours is cleaner):
Both are correct. Yours is simpler and avoids the redundant poll Adopt-when-merged: I'll remove my local patch in favor of yours |
|
Independently reproduced on Windows 11 (Python 3.13.4,
I built the same fix independently and verified end-to-end on Windows: a should-trigger query is detected (returns early, ~6s), a should-not is not, and a controlled hard-timeout test confirmed the deadline fires, the child is killed, and the reader thread joins cleanly with no deadlock, including through a One thing worth flagging for anyone tracking this: this fix is necessary but not sufficient to make the optimizer usable on Windows. Even with the read loop fixed, the loop still crashes on Aside, unrelated to this PR's code: some context on how community code contributions here have been getting reviewed, for anyone weighing whether to invest effort: #1195 |
|
Third independent Windows reproduction. Env: Win11 Pro 26200 / Python 3.13.13 / uv via Astral / Git Bash (MSYS2) / claude-code 2.1.145. SHA-verified the local The silent-failure aspect @dmwyatt flagged is what cost me the most diagnosis time. Every query returned Confirming @dmwyatt's downstream point from my own workload: my skill description contains em-dashes ( +1 for merge from another affected production user. Happy to test the merged fix against a real multi-iteration optimization workload. |
Summary
run_eval.pyis unusable on Windows: every query gets recorded as "not triggered" regardless of the description being tested, so the optimization loop reportsprecision=100% recall=0%on every iteration. Symptom is a flood ofWarning: query failed: [WinError 10038] An operation was attempted on something that is not a socketlines.Root cause is the read loop that polls the child process's stdout pipe with
select.select:On Windows,
select.select()only accepts socket file descriptors — pipe fds raiseOSError WinError 10038. The exception bubbles up torun_eval's broadexcept Exception, which logs the warning and pushesFalseinto the per-query result list. With every read failing instantly, noclaude -pinvocation ever has its stdout consumed, so trigger detection always fails.Fix
Replace the
select-based poll with a daemon thread that reads stdout into aqueue.Queue, and usequeue.get(timeout=1.0)on the main loop. Same time budget, same back-pressure semantics, works identically on POSIX.The thread is
daemon=True, so it dies when the parent process exits; the existingprocess.kill()in the surroundingfinallycauses the reader'sread()to returnb''and push theNonesentinel cleanly.Reproduction
On any Windows host with Claude Code authenticated:
Pre-fix output (excerpt):
Post-fix on the same eval set:
WinError 10038Test environment
ANTHROPIC_API_KEY)Notes
if process.poll() is not None: ... read remainingearly-exit branch because the reader thread already drains stdout naturally and pushes theNonesentinel as soon as the child closes its stdout (which happens when it exits). The previous branch was an optimization that became redundant.if sys.platform == "win32") but chose the unconditional thread approach because it eliminates the platform-specific code path and the per-platform test surface. Performance is identical for skill-eval workloads (kilobytes per query, single-digit events per second).