Skip to content

skill-creator: fix run_eval.py crash on Windows when reading from subprocess pipe#1099

Open
joshuawowk wants to merge 1 commit into
anthropics:mainfrom
joshuawowk:fix/run-eval-windows-pipe-select
Open

skill-creator: fix run_eval.py crash on Windows when reading from subprocess pipe#1099
joshuawowk wants to merge 1 commit into
anthropics:mainfrom
joshuawowk:fix/run-eval-windows-pipe-select

Conversation

@joshuawowk

Copy link
Copy Markdown

Summary

run_eval.py is unusable on Windows: every query gets recorded as "not triggered" regardless of the description being tested, so the optimization loop reports precision=100% recall=0% on every iteration. Symptom is a flood of Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket lines.

Root cause is the read loop that polls the child process's stdout pipe with select.select:

ready, _, _ = select.select([process.stdout], [], [], 1.0)

On Windows, select.select() only accepts socket file descriptors — pipe fds raise OSError WinError 10038. The exception bubbles up to run_eval's broad except Exception, which logs the warning and pushes False into the per-query result list. With every read failing instantly, no claude -p invocation ever has its stdout consumed, so trigger detection always fails.

Fix

Replace the select-based poll with a daemon thread that reads stdout into a queue.Queue, and use queue.get(timeout=1.0) on the main loop. Same time budget, same back-pressure semantics, works identically on POSIX.

chunk_queue: "queue.Queue[bytes | None]" = queue.Queue()

def _reader() -> None:
    try:
        while True:
            chunk = process.stdout.read(8192)
            if not chunk:
                break
            chunk_queue.put(chunk)
    finally:
        chunk_queue.put(None)  # EOF sentinel

reader_thread = threading.Thread(target=_reader, daemon=True)
reader_thread.start()

# main loop:
try:
    chunk = chunk_queue.get(timeout=1.0)
except queue.Empty:
    continue
if chunk is None:
    break  # reader thread saw EOF
buffer += chunk.decode("utf-8", errors="replace")

The thread is daemon=True, so it dies when the parent process exits; the existing process.kill() in the surrounding finally causes the reader's read() to return b'' and push the None sentinel cleanly.

Reproduction

On any Windows host with Claude Code authenticated:

# Any non-trivial skill works; smaller is faster.
python -m scripts.run_loop `
  --eval-set my-eval.json `
  --skill-path C:\path\to\my-skill `
  --model claude-sonnet-4-5 `
  --max-iterations 1 `
  --verbose

Pre-fix output (excerpt):

Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
Warning: query failed: [WinError 10038] An operation was attempted on something that is not a socket
... (one per query) ...
Train: 12/24 correct, precision=100% recall=0% accuracy=50%
  [FAIL] rate=0/3 expected=True: <every should-trigger query>
  [PASS] rate=0/3 expected=False: <every should-not-trigger query>

Post-fix on the same eval set:

  • Run completes without WinError 10038
  • Per-query trigger rates vary by query (no longer uniformly 0)
  • Both true positives and true negatives are reachable

Test environment

  • Windows 11 24H2
  • Python 3.13.12
  • Claude Code 2.1.101 (authenticated to a Max plan, no ANTHROPIC_API_KEY)
  • Eval set with 10 should-trigger and 10 should-not-trigger queries

Notes

  • No changes to behavior on POSIX. The thread approach matches the original poll cadence (1 second wakeups).
  • I deliberately removed the if process.poll() is not None: ... read remaining early-exit branch because the reader thread already drains stdout naturally and pushes the None sentinel as soon as the child closes its stdout (which happens when it exits). The previous branch was an optimization that became redundant.
  • I considered a platform gate (if sys.platform == "win32") but chose the unconditional thread approach because it eliminates the platform-specific code path and the per-platform test surface. Performance is identical for skill-eval workloads (kilobytes per query, single-digit events per second).

…process pipe

select.select() on a subprocess stdout pipe raises OSError [WinError 10038] on Windows because Windows' select() only accepts socket fds. The except handler in run_eval converted that into a 'query failed: ...' warning and recorded every result as not-triggered, so the optimization loop reported precision=100% recall=0% regardless of the description being tested.

Replace the select-based polling with a small daemon thread that reads stdout into a queue and a queue.get(timeout=1.0) on the main loop. Same time budget, same back-pressure, works identically on POSIX. Verified on Windows 11 (Python 3.13.12, claude 2.1.101) with a 20-query eval set: previously 0 of 10 should-trigger queries triggered; with this patch the run completes end-to-end and trigger rates are real.
@SaluxSolutions

Copy link
Copy Markdown

Independent confirmation of this bug pattern from another Windows
environment, and an independent convergence on functionally the
same fix.

Where I hit it: running scripts/run_eval.py from the
marketplace-distributed copy of skill-creator at
anthropics/claude-plugins-official/plugins/skill-creator/skills/skill-creator/scripts/run_eval.py,
which mirrors the upstream source-of-truth at
anthropics/skills:skills/skill-creator/scripts/run_eval.py (the diff
base of this PR).

Symptom: OSError [WinError 10038] from select.select() on the
subprocess pipe fd — exactly the bug your PR description identifies.
Confirms it's not a one-off; reproduces against a freshly-installed
marketplace plugin without any local modifications.

Convergent fix: I patched my local copy before finding this PR.
The resulting diff is functionally near-identical to yours — same
imports (queue, threading), same chunk_queue + _reader() thread
pattern replacing select.select + os.read(fileno). The independent
arrival at the same shape is, I think, a useful signal that this is
the right portable approach.

One small implementation delta (yours is cleaner):

  • Yours relies entirely on the EOF sentinel from the reader thread.
  • Mine added a defensive early-break: if process.poll() is not None and output_queue.empty(): break in the main loop.

Both are correct. Yours is simpler and avoids the redundant poll
check — the EOF sentinel handles process termination by definition.
No suggestion attached; just noting the difference so it doesn't
look like a missing case if a reviewer compares the two.

Adopt-when-merged: I'll remove my local patch in favor of yours
once this lands upstream. Happy to test the final version against
a fresh Windows install before/after merge if useful.

@dmwyatt

dmwyatt commented May 24, 2026

Copy link
Copy Markdown

Independently reproduced on Windows 11 (Python 3.13.4, uv, Git Bash) against a real claude -p subprocess. Confirming the root cause and that this fix is the right shape.

select.select([process.stdout], ...) raises OSError: [WinError 10038] on the first read of every query, because on Windows select is WinSock-only and rejects pipe handles (Python docs, select.select: "File objects on Windows are not acceptable, but sockets are ... does not handle file descriptors that don't originate from WinSock"). The exception is swallowed by run_eval's broad except, so every query is recorded as not-triggered and the loop reports precision=100% / recall=0% / accuracy=50% on every iteration, tuning against noise. A reader-thread + queue.Queue is the correct stdlib-only, cross-platform replacement (the selectors module shares the same WinSock limitation, so it is not an alternative here).

I built the same fix independently and verified end-to-end on Windows: a should-trigger query is detected (returns early, ~6s), a should-not is not, and a controlled hard-timeout test confirmed the deadline fires, the child is killed, and the reader thread joins cleanly with no deadlock, including through a ProcessPoolExecutor worker. Two small deltas you may or may not want: reading by line (iter(stream.readline, b"")) instead of fixed chunks lets the existing newline-split parser go away; bounding the get with queue.get(timeout=min(remaining, 1.0)) keeps the deadline tight to within the last partial second; and reader.join(timeout=1.0) in the finally avoids leaving a daemon thread per query in a reused pool worker. All optional; your version is correct as-is.

One thing worth flagging for anyone tracking this: this fix is necessary but not sufficient to make the optimizer usable on Windows. Even with the read loop fixed, the loop still crashes on scripts/utils.py:9 (parse_skill_md does (skill_path / "SKILL.md").read_text() with no encoding) for any SKILL.md containing non-cp1252 bytes (an emoji or smart quotes in the description), and the same locale-default-encoding pattern affects roughly seven files. That is a separate bug; PR #1050 covers part of it. Full audit (all affected files/lines) here: #1050 (comment)


Aside, unrelated to this PR's code: some context on how community code contributions here have been getting reviewed, for anyone weighing whether to invest effort: #1195

@hiiqbiz-wq

Copy link
Copy Markdown

Third independent Windows reproduction.

Env: Win11 Pro 26200 / Python 3.13.13 / uv via Astral / Git Bash (MSYS2) / claude-code 2.1.145. SHA-verified the local run_eval.py is byte-identical to this PR's base — clean upstream repro, no local patches.

The silent-failure aspect @dmwyatt flagged is what cost me the most diagnosis time. Every query returned triggered=False with no surfaced error, so my first read was "my skill description is broken," not "the eval loop is broken." Even before this PR lands, a single print(f"warning: query failed: {e}", file=sys.stderr) at the bare except in run_single_query would shave hours off the next Windows user's debug.

Confirming @dmwyatt's downstream point from my own workload: my skill description contains em-dashes (), so parse_skill_md would have been my next blocker even with this fix in. PR #1050 covers it; these two PRs together are the actual "make skill-creator work on Windows" change set.

+1 for merge from another affected production user. Happy to test the merged fix against a real multi-iteration optimization workload.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants