Skip to content

fix: install missing optuna dep and add LLM request timeout#41

Open
seilk wants to merge 1 commit intoNousResearch:mainfrom
seilk:fix/missing-optuna-and-request-timeout
Open

fix: install missing optuna dep and add LLM request timeout#41
seilk wants to merge 1 commit intoNousResearch:mainfrom
seilk:fix/missing-optuna-and-request-timeout

Conversation

@seilk
Copy link
Copy Markdown

@seilk seilk commented Apr 26, 2026

Summary

Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone. Both surfaced while running evolve_skill --eval-source synthetic --iterations 3 end-to-end.

1. Missing optuna dependency

evolve_skill.py falls back to MIPROv2 automatically when GEPA fails to initialize, which currently happens on every DSPy ≥3.0 install because GEPA.__init__() no longer accepts max_steps (see open PRs #14, #35, #39 for the GEPA-side fix).

MIPROv2 imports optuna lazily inside _optimize_prompt_parameters. So the fallback runs through Steps 1 and 2 (bootstrap + instruction proposal — already a few minutes of LLM calls and dollars spent) and then crashes:

ModuleNotFoundError: No module named 'optuna'
ImportError: MIPROv2 requires optional dependency 'optuna'.
            Install it with `pip install dspy[optuna]`.

DSPy already declares the right version pin in its optuna extra, so the simplest fix is dspy>=3.0.0dspy[optuna]>=3.0.0. No version drift risk.

2. No request timeout on LLM calls

litellm.request_timeout is unset by default, so every DSPy LM call inherits a no-deadline httpx client. When an upstream provider silently drops a long-lived POST without sending a TCP RST (we hit this on a corporate proxy gateway), the python process keeps the dead socket in ESTABLISHED and waits forever. The optimization loop blocks at 0% CPU on a single hung call. We measured 14+ minutes of total stall before manually killing the run.

This PR sets litellm.request_timeout at module import time, with an env-var override:

litellm.request_timeout = float(os.environ.get("LITELLM_REQUEST_TIMEOUT", "90"))

90 seconds is generous for sonnet/opus reasoning tokens but short enough to detect a dead socket and let litellm's retry logic recover. Override with LITELLM_REQUEST_TIMEOUT=N for slow models.

Verification

$ pytest tests/ -q
143 passed, 11 warnings in 1.07s

(139 existing + 4 new tests covering: default 90s, env override, float parsing, fail-fast on invalid value.)

End-to-end run of evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3 --optimizer-model anthropic/claude-sonnet-4-6 --eval-model anthropic/claude-sonnet-4-6 against a flaky upstream gateway now completes the full loop (baseline 35.04 → best 36.18 across 10 trials in 12 min). Before this fix the same command hung indefinitely on a silent gateway drop.

Out of scope

This PR deliberately does NOT touch:

These three areas already have multiple competing PRs; adding a fourth attempt would just add review noise.

Why ship this separately

The two bugs here have zero overlap with any open PR or issue — verified by searching the issue tracker and reading every open PR diff. Both are install-time / robustness fixes that any of the other PRs would benefit from being merged on top of.

Two unrelated stock-install bugs that together prevent the optimization
loop from completing on a fresh clone:

1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2
   automatically when GEPA fails to initialize (which it currently always
   does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at
   `_optimize_prompt_parameters` time, so the fallback crashes with
   `ModuleNotFoundError: No module named 'optuna'` immediately after
   Step 2 finishes proposing instruction candidates. Switching the
   declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets
   DSPy itself manage the version pin.

2. **No request timeout on LLM calls.** litellm's default request_timeout
   is unset, so any silent connection drop from the upstream provider
   (we hit this on a corporate/proxy gateway that drops long-lived
   POSTs without a TCP RST) hangs the optimization loop indefinitely
   with the python process holding an established but dead TCP socket
   at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes
   on a single hung call before manual intervention.

   Setting `litellm.request_timeout` at module import time gives every
   DSPy LM call a per-request deadline. Default 90s (generous for
   sonnet/opus reasoning tokens, short enough to detect a dead socket).
   Override via `LITELLM_REQUEST_TIMEOUT` env var.

Verification: 143 tests pass (139 existing + 4 new tests covering the
timeout default, env override, float parsing, and fail-fast on bad
input). End-to-end run of `evolve_skill --skill llm-wiki-extract
--eval-source synthetic --iterations 3` against a flaky upstream
gateway now completes (before this fix it hung indefinitely).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant