fix: install missing optuna dep and add LLM request timeout#41
Open
seilk wants to merge 1 commit intoNousResearch:mainfrom
Open
fix: install missing optuna dep and add LLM request timeout#41seilk wants to merge 1 commit intoNousResearch:mainfrom
seilk wants to merge 1 commit intoNousResearch:mainfrom
Conversation
Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone: 1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2 automatically when GEPA fails to initialize (which it currently always does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at `_optimize_prompt_parameters` time, so the fallback crashes with `ModuleNotFoundError: No module named 'optuna'` immediately after Step 2 finishes proposing instruction candidates. Switching the declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets DSPy itself manage the version pin. 2. **No request timeout on LLM calls.** litellm's default request_timeout is unset, so any silent connection drop from the upstream provider (we hit this on a corporate/proxy gateway that drops long-lived POSTs without a TCP RST) hangs the optimization loop indefinitely with the python process holding an established but dead TCP socket at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes on a single hung call before manual intervention. Setting `litellm.request_timeout` at module import time gives every DSPy LM call a per-request deadline. Default 90s (generous for sonnet/opus reasoning tokens, short enough to detect a dead socket). Override via `LITELLM_REQUEST_TIMEOUT` env var. Verification: 143 tests pass (139 existing + 4 new tests covering the timeout default, env override, float parsing, and fail-fast on bad input). End-to-end run of `evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3` against a flaky upstream gateway now completes (before this fix it hung indefinitely).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone. Both surfaced while running
evolve_skill --eval-source synthetic --iterations 3end-to-end.1. Missing optuna dependency
evolve_skill.pyfalls back to MIPROv2 automatically when GEPA fails to initialize, which currently happens on every DSPy ≥3.0 install becauseGEPA.__init__()no longer acceptsmax_steps(see open PRs #14, #35, #39 for the GEPA-side fix).MIPROv2 imports
optunalazily inside_optimize_prompt_parameters. So the fallback runs through Steps 1 and 2 (bootstrap + instruction proposal — already a few minutes of LLM calls and dollars spent) and then crashes:DSPy already declares the right version pin in its
optunaextra, so the simplest fix isdspy>=3.0.0→dspy[optuna]>=3.0.0. No version drift risk.2. No request timeout on LLM calls
litellm.request_timeoutis unset by default, so every DSPy LM call inherits a no-deadline httpx client. When an upstream provider silently drops a long-lived POST without sending a TCP RST (we hit this on a corporate proxy gateway), the python process keeps the dead socket inESTABLISHEDand waits forever. The optimization loop blocks at 0% CPU on a single hung call. We measured 14+ minutes of total stall before manually killing the run.This PR sets
litellm.request_timeoutat module import time, with an env-var override:90 seconds is generous for sonnet/opus reasoning tokens but short enough to detect a dead socket and let litellm's retry logic recover. Override with
LITELLM_REQUEST_TIMEOUT=Nfor slow models.Verification
(139 existing + 4 new tests covering: default 90s, env override, float parsing, fail-fast on invalid value.)
End-to-end run of
evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3 --optimizer-model anthropic/claude-sonnet-4-6 --eval-model anthropic/claude-sonnet-4-6against a flaky upstream gateway now completes the full loop (baseline 35.04 → best 36.18 across 10 trials in 12 min). Before this fix the same command hung indefinitely on a silent gateway drop.Out of scope
This PR deliberately does NOT touch:
max_stepsAPI mismatch (fix: GEPA API compat, constraint validation, and skill text evolution #14, fix: correct GEPA arg, constraint input, and rate limit handling in evolve_skill #35, fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness #39 in flight)These three areas already have multiple competing PRs; adding a fourth attempt would just add review noise.
Why ship this separately
The two bugs here have zero overlap with any open PR or issue — verified by searching the issue tracker and reading every open PR diff. Both are install-time / robustness fixes that any of the other PRs would benefit from being merged on top of.