fix: install missing optuna dep and add LLM request timeout by seilk · Pull Request #41 · NousResearch/hermes-agent-self-evolution

seilk · 2026-04-26T16:04:40Z

Summary

Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone. Both surfaced while running evolve_skill --eval-source synthetic --iterations 3 end-to-end.

1. Missing optuna dependency

evolve_skill.py falls back to MIPROv2 automatically when GEPA fails to initialize, which currently happens on every DSPy ≥3.0 install because GEPA.__init__() no longer accepts max_steps (see open PRs #14, #35, #39 for the GEPA-side fix).

MIPROv2 imports optuna lazily inside _optimize_prompt_parameters. So the fallback runs through Steps 1 and 2 (bootstrap + instruction proposal — already a few minutes of LLM calls and dollars spent) and then crashes:

ModuleNotFoundError: No module named 'optuna'
ImportError: MIPROv2 requires optional dependency 'optuna'.
            Install it with `pip install dspy[optuna]`.

DSPy already declares the right version pin in its optuna extra, so the simplest fix is dspy>=3.0.0 → dspy[optuna]>=3.0.0. No version drift risk.

2. No request timeout on LLM calls

litellm.request_timeout is unset by default, so every DSPy LM call inherits a no-deadline httpx client. When an upstream provider silently drops a long-lived POST without sending a TCP RST (we hit this on a corporate proxy gateway), the python process keeps the dead socket in ESTABLISHED and waits forever. The optimization loop blocks at 0% CPU on a single hung call. We measured 14+ minutes of total stall before manually killing the run.

This PR sets litellm.request_timeout at module import time, with an env-var override:

litellm.request_timeout = float(os.environ.get("LITELLM_REQUEST_TIMEOUT", "90"))

90 seconds is generous for sonnet/opus reasoning tokens but short enough to detect a dead socket and let litellm's retry logic recover. Override with LITELLM_REQUEST_TIMEOUT=N for slow models.

Verification

$ pytest tests/ -q
143 passed, 11 warnings in 1.07s

(139 existing + 4 new tests covering: default 90s, env override, float parsing, fail-fast on invalid value.)

End-to-end run of evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3 --optimizer-model anthropic/claude-sonnet-4-6 --eval-model anthropic/claude-sonnet-4-6 against a flaky upstream gateway now completes the full loop (baseline 35.04 → best 36.18 across 10 trials in 12 min). Before this fix the same command hung indefinitely on a silent gateway drop.

Out of scope

This PR deliberately does NOT touch:

These three areas already have multiple competing PRs; adding a fourth attempt would just add review noise.

Why ship this separately

The two bugs here have zero overlap with any open PR or issue — verified by searching the issue tracker and reading every open PR diff. Both are install-time / robustness fixes that any of the other PRs would benefit from being merged on top of.

Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone: 1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2 automatically when GEPA fails to initialize (which it currently always does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at `_optimize_prompt_parameters` time, so the fallback crashes with `ModuleNotFoundError: No module named 'optuna'` immediately after Step 2 finishes proposing instruction candidates. Switching the declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets DSPy itself manage the version pin. 2. **No request timeout on LLM calls.** litellm's default request_timeout is unset, so any silent connection drop from the upstream provider (we hit this on a corporate/proxy gateway that drops long-lived POSTs without a TCP RST) hangs the optimization loop indefinitely with the python process holding an established but dead TCP socket at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes on a single hung call before manual intervention. Setting `litellm.request_timeout` at module import time gives every DSPy LM call a per-request deadline. Default 90s (generous for sonnet/opus reasoning tokens, short enough to detect a dead socket). Override via `LITELLM_REQUEST_TIMEOUT` env var. Verification: 143 tests pass (139 existing + 4 new tests covering the timeout default, env override, float parsing, and fail-fast on bad input). End-to-end run of `evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3` against a flaky upstream gateway now completes (before this fix it hung indefinitely).

innoscoutpro mentioned this pull request Apr 27, 2026

fix: integrate critical self-evolution pipeline fixes #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: install missing optuna dep and add LLM request timeout#41

fix: install missing optuna dep and add LLM request timeout#41
seilk wants to merge 1 commit intoNousResearch:mainfrom
seilk:fix/missing-optuna-and-request-timeout

seilk commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seilk commented Apr 26, 2026

Summary

1. Missing optuna dependency

2. No request timeout on LLM calls

Verification

Out of scope

Why ship this separately

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant