fix: integrate critical self-evolution pipeline fixes by innoscoutpro · Pull Request #42 · NousResearch/hermes-agent-self-evolution

innoscoutpro · 2026-04-27T07:38:15Z

Summary

Consolidates the focused, mergeable fixes from the open PR backlog into one tested integration branch.

This PR incorporates/adapts work from:

fix: install missing optuna dep and add LLM request timeout #41 — missing Optuna extra + LiteLLM request timeout
feat(importers): read full Claude Code transcripts from ~/.claude/projects #40 — full Claude Code transcript importer from ~/.claude/projects
fix: validate full skill files during evolution #32 — full-skill structural validation while keeping size/growth checks body-based
fix: runtime bugs + make skill text optimizable by DSPy #24 — DSPy/GEPA runtime fixes, robust JSON extraction, and optimizable skill text instructions
feat: improve skill_fitness_metric with multi-dimensional scoring #28 — stronger multi-signal fitness metric with actionable feedback
fix: gate evolution success on artifact diffs #16 — reject no-op “improvements” where the artifact did not change
feat: add MiniMax provider support #15 — MiniMax/OpenAI-compatible provider config support

What changed

Uses dspy[optuna] and configures a default LiteLLM request timeout to avoid hung optimization runs.
Adds Claude Code project transcript import support.
Fixes skill constraint validation so frontmatter validation runs on a reassembled full SKILL.md, while size and growth limits still apply to the mutable body.
Makes skill text part of the optimizable instruction surface instead of a dead instance attribute.
Updates GEPA usage from the removed max_steps API to max_metric_calls.
Replaces naive keyword-overlap-only scoring with a deterministic multi-signal metric that returns feedback for GEPA.
Adds an explicit no-op gate: a run is only successful if score improves and the evolved artifact differs from baseline.
Adds MiniMax provider support via EvolutionConfig.make_lm() and CLI/provider plumbing.

Issues addressed

Closes #10
Closes #11
Closes #12
Closes #34
Closes #38

Partially addresses #33 by improving fitness signal, mutation behavior, dependency/runtime reliability, and no-op gating. Remaining #33 items should be split into follow-ups: benchmark gate, PR proposal/deployment path, and true E2E evolution test.

Helps with #37 by reducing the PR backlog into a single tested integration branch.

Test plan

python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip setuptools wheel
python -m pip install -e '.[dev]'
pytest -q

Result locally:

177 passed, 11 warnings

Notes

I intentionally did not merge every open PR as-is. Several are duplicates, superseded, stacked, or too broad/generated-artifact-heavy. This branch takes the smallest coherent set that closes the core technical blockers and keeps the suite green.

- Add MiniMax chat model provider via OpenAI-compatible endpoint - Add MINIMAX_API_KEY and MINIMAX_BASE_URL config fields to EvolutionConfig - Add make_lm() helper to EvolutionConfig that routes MiniMax models to https://api.minimax.io/v1 with correct temperature (1.0, required by MiniMax) - Support bare model IDs and prefixed forms (minimax/, openai/) - Add --use-minimax CLI flag to evolve_skill.py for easy MiniMax selection - Update dataset_builder.py and fitness.py to use config.make_lm() - Add 16 unit tests covering MiniMax config and LM routing - Document MiniMax usage in README Supported models: MiniMax-M2.7, MiniMax-M2.7-highspeed

Three bug fixes that prevent the pipeline from running: 1. dataset_builder: LLM returns Python-style dicts (single quotes), not valid JSON. Added ast.literal_eval fallback + trailing comma fix so synthetic dataset generation doesn't crash on parse. 2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now max_metric_calls. Fixed the call and added auto='light'. 3. constraints: _check_skill_structure was checking the skill BODY for YAML frontmatter, which it never has after splitting. Rewrote to validate body structure (headings, procedural content, substance). One architectural improvement: 4. skill_module: Skill text was passed as an input field, so the optimizer could never mutate it. Restructured to embed skill text in the instruction template via with_instructions(), allowing MIPROv2/GEPA to propose improved skill bodies. Updated extraction logic in evolve_skill.py to pull evolved text from the compiled predictor's instruction.

Replaces the single keyword-overlap scorer with a weighted composite of five independent signals that spread scores across a much wider range: 1. Keyword overlap (25%) - stop-word filtered, F1-style blend 2. Character 3-gram similarity (25%) - Jaccard on char shingles 3. Structural pattern matching (20%) - code blocks, lists, headers 4. Length quality (15%) - proportional to expected output length 5. Content density (15%) - unique token ratio, avg token length, variety Also: - Returns dspy.Prediction(score=float, feedback=str) for GEPA reflective mutation compatibility - Feedback string highlights specific weaknesses for optimizer use - All scoring is deterministic (no LLM calls) for speed during optimization loops Fixes NousResearch#12

…jects Claude Code stores rich session transcripts (user prompts + assistant responses + tool calls) at ~/.claude/projects/<encoded-cwd>/<id>.jsonl. The previous ClaudeCodeImporter only read ~/.claude/history.jsonl, which is a flat log of *user prompts only* — no assistant responses. That meant Claude Code was the only sessiondb source that produced unpaired examples, while Copilot and Hermes both yielded (task_input, assistant_response) pairs. Downstream consumers (RelevanceFilter, build_dataset_from_external) already plumb assistant_response through, so the data shape gap was the only blocker. Changes: - Extend ClaudeCodeImporter with PROJECTS_DIR + _extract_from_projects. - Add _parse_claude_code_session helper, mirroring _parse_copilot_events. Handles user/assistant interleaving, tool_use/tool_result skipping, multi-block assistant turns, malformed JSON, and secret redaction. - New `source` arg on extract_messages: "auto" (default, prefers projects/, falls back to history.jsonl), "projects", or "history". - Existing tests updated to pass `source="history"` (now explicit), plus 12 new tests covering pair extraction, tool-result skipping, secret filtering, multi-session walking, limits, and auto fallback. Verified on real ~/.claude/projects/ data: yields paired examples with both task_input and assistant_response fields. Closes the data-quality gap noted in NousResearch#3 for Claude Code users.

Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone: 1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2 automatically when GEPA fails to initialize (which it currently always does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at `_optimize_prompt_parameters` time, so the fallback crashes with `ModuleNotFoundError: No module named 'optuna'` immediately after Step 2 finishes proposing instruction candidates. Switching the declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets DSPy itself manage the version pin. 2. **No request timeout on LLM calls.** litellm's default request_timeout is unset, so any silent connection drop from the upstream provider (we hit this on a corporate/proxy gateway that drops long-lived POSTs without a TCP RST) hangs the optimization loop indefinitely with the python process holding an established but dead TCP socket at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes on a single hung call before manual intervention. Setting `litellm.request_timeout` at module import time gives every DSPy LM call a per-request deadline. Default 90s (generous for sonnet/opus reasoning tokens, short enough to detect a dead socket). Override via `LITELLM_REQUEST_TIMEOUT` env var. Verification: 143 tests pass (139 existing + 4 new tests covering the timeout default, env override, float parsing, and fail-fast on bad input). End-to-end run of `evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3` against a flaky upstream gateway now completes (before this fix it hung indefinitely).

# Conflicts: # evolution/skills/evolve_skill.py

# Conflicts: # evolution/skills/evolve_skill.py # tests/skills/test_evolve_skill.py

… output/ - Pin upper bounds on dspy/openai/pyyaml/click/rich to fence against API breaks and supply-chain auto-bumps (security review M4). - Move pytest/pytest-asyncio to bounded ranges in dev extras. - New optional [report] extra for generate_report.py's reportlab dep (audit C1.M4 — the import would fail without it on a stock install). - Add output/ and proposals/ to .gitignore (security review C1) — runs may write derived skill bodies and metrics that include paraphrased session content; never auto-commit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…rmes_repo Addresses security review findings C2, H1, H2, H5, M1. evolution/core/external_importers.py - SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza, Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT 3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL, HF_TOKEN. Generic api_key/secret/token/credential assignment patterns. Existing test cases (177) still pass — patterns relaxed where the test suite expected loose matching (short tokens, bare PRIVATE KEY). - New scrub_secrets(text) helper for defence-in-depth scanning of outputs the model may have paraphrased into secret-shaped strings. evolution/skills/skill_module.py - find_skill rejects skill names containing path separators or shell metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector. - find_skill resolves and refuses any SKILL.md whose real path lies outside the skills/ tree (symlink-escape protection, H5). - Add SkillModule(treat_as_untrusted=True) preamble that tells the optimizer to treat skill body as DATA, not commands. Mitigates prompt-injection from third-party transcripts (C2). - Switch body delimiter from "\n\n---\n" to HTML-comment sentinels (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal rules survive extraction (forward-port of upstream PR NousResearch#39 idea). evolution/core/constraints.py - run_test_suite(hermes_repo) now resolves the path, then refuses to invoke pytest unless pyproject.toml + tests/ exist and pyproject references hermes-agent. Pytest auto-loads conftest.py, so pointing at an untrusted tree was equivalent to RCE (M1). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

…estration Closes the remaining audit items from upstream issue NousResearch#33 and applies the must-fix orchestration findings from the local code+security review. evolution/core/benchmark_gate.py (new) - BenchmarkGate.run_all() returns one BenchmarkResult per enabled gate; TBLite runner is a deliberate stub (skipped=True with explanatory message) until hermes-agent batch_runner is wired in. Closes NousResearch#33 C3 structurally — config flag now produces a real, observable result instead of being silently ignored. evolution/skills/evolve_skill.py (rewrite) - Wire --run-tests: validator.run_test_suite() is called after constraint validation; failure rejects the variant. Closes NousResearch#33 C2. - Wire --create-pr: writes a proposal bundle to output/proposals/<skill>/<ts>/ (baseline_skill.md, evolved_skill.md, metrics.json, decision.json, diff.patch). Filesystem-only — never runs git ops or pushes. Closes NousResearch#33 H1. - --use-minimax sets the MiniMax model only as a *default* now; user- supplied --optimizer-model / --eval-model / --judge-model always win. Closes security review H3 (jurisdictional surprise). - New --consent-external-ingest flag is required for --eval-source sessiondb. The pipeline aborts with a red banner otherwise. Enforced before --dry-run so users learn about the requirement during setup validation. Closes security review H4. - New --use-llm-judge flag opts in to LLMJudge scoring; default stays on the deterministic multi-signal metric. - New --judge-model flag — defaults to --eval-model when unset. Resolves judge_model inconsistency between MiniMax / non-MiniMax paths flagged in code review. - Body extraction uses HTML-comment sentinels and distinguishes "no-op" (evolved == baseline) from "extraction failed" (optimizer returned no usable output). Closes code review NousResearch#6. - GEPA is constructed with max_metric_calls only (no auto="light" conflict). Except clause narrowed to TypeError/AttributeError/ ImportError so genuine fitness errors bubble. Closes code review NousResearch#4. - scrub_secrets() runs over the evolved body before write — defence- in-depth against the model paraphrasing a leaked secret. Closes security review H2. - Output paths now use config.output_dir consistently; --output-dir CLI flag added. Closes code review NousResearch#10. - Cleaned up dead imports (Panel, FitnessScore, get_hermes_agent_path, re as _re) and the stale "import re as _re" never referenced. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

177 → 237 passing tests after this commit. tests/core/test_fitness.py (new) - Multi-signal scorer: empty/whitespace output, identical text, unrelated text, prediction shape, keyword overlap edge cases (identical, disjoint, empty expected, stop-word filtering), char n-gram boundaries, structural match recall + noise penalty, length-quality boundaries, content density, score parsing. - LLMJudge wiring: uninitialized → deterministic, judge failure → fallback with "[judge unavailable: ...]" feedback flag, judge success → composite (0.4/0.3/0.3 weighting). - FitnessScore composite + length-penalty arithmetic. tests/core/test_benchmark_gate.py (new) - Disabled config returns no results. - Enabled config returns one result with skipped=True / passed=True / threshold populated and a "not implemented" message. - Display message formatting for skipped / passed / failed states. tests/core/test_security_hardening.py (new) - find_skill rejects path-traversal names, path separators, shell metacharacters; accepts legit names; refuses symlink escape from skills tree; returns None for missing skills/. - scrub_secrets redacts Anthropic / JWT / password assignments; preserves innocent prose; honours custom replacement. - SkillModule wraps body with untrusted-data preamble; sentinels are always present; bodies containing markdown horizontal rules survive round-trip extraction. - run_test_suite refuses non-existent paths, unrelated projects (pyproject doesn't reference hermes-agent), and missing pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

… status - New "How it integrates with hermes-agent" section answers upstream issue NousResearch#18 explicitly: discovery (HERMES_AGENT_REPO / ~/.hermes / sibling), the 6-step pipeline, and the deployment story (proposal bundle → human review → manual copy or PR; never auto-merge). - New Privacy & Security section: documents what each --eval-source sends where, the --consent-external-ingest gate, the --use-minimax precedence rule (user models win), what the secret detector does and does not catch, the output/ gitignore caveat, the untrusted-data preamble, and the run_test_suite path validation. - Phase 2-5 table now marks the empty package directories as "Stub only" instead of "Planned" — evolution/tools/, prompts/, code/, monitor/ each contain only an empty __init__.py and the prior wording was misleading. - Updated Quick Start with --use-llm-judge, --consent-external-ingest, --run-tests, --create-pr examples. - Guardrails section rewritten to reflect what is actually enforced (structural integrity on full SKILL.md, body-only size/growth, no-op vs extraction-failure distinction, optional pytest gate, optional benchmark gate stub, proposal-bundle-only deployment). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

HIGH: - consent gate on standalone external_importers CLI (was bypassable) - _load_skill_text validates skill name against [A-Za-z0-9_.-]+ - --source-project filter limits Claude Code mining to one project - PII_PATTERNS scrub emails, IPs, phone numbers, SSNs - detailed consent warning enumerates actual data categories sent MEDIUM: - repr=False on minimax_api_key to prevent leaks via repr/logs - validate_model_string() rejects URLs and unknown provider prefixes - 9 new secret patterns: OPENSSH key, Databricks, DigitalOcean, npm, PyPI, Vault, Telegram, Supabase, Vercel + connection strings (postgres://, mysql://, mongodb://, redis://, amqp://) with creds - _check_prompt_injection scans evolved text for known patterns (ignore previous, exfiltrate, reveal system prompt, ...) - _warn_stale_datasets warns when JSONL files older than 7 days - MiniMax jurisdiction note in consent text when --use-minimax active - litellm pinned explicitly (>=1.50.0,<2) - requirements.lock for reproducible builds 48 new tests in tests/core/test_audit_fixes.py covering every fix. Full suite: 282 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

pr/26 (Kevin Laithwaite): - evolution/core/codex_lm.py — DSPy LM wrapping `codex exec --json` for GPT-5.4 via ChatGPT Plus OAuth (no API key needed) - RelevanceFilter three-stage pipeline: LLM-expanded keywords (one call, 30-50 synonyms/phrases) → substring pre-filter on full corpus → LLM scoring. Raises candidate budget from 3x to 8x to let the LLM reject borderline cases. Verified upstream: +160% relevant examples found (17→44 from 2590-message corpus) and downstream +11.5% holdout gain. pr/25 (Eric): - evolution/monitor/progress.py — SQLite-backed run tracker at ~/.hermes/evolution_progress.db with start_run/log_event/complete_run/ fail_run/get_active_run/get_run_history/get_run_events. - evolve_skill.py wires hooks at: start, skill load, dataset built, optimization complete, validation fail, run complete. pr/19 (Hermes Bot): - EvolutionConfig.api_base + api_key for vLLM/Ollama/LiteLLM-compatible endpoints. Forwarded to dspy.LM only when set; MiniMax routing unchanged. api_key uses repr=False for the same leak protection as minimax_api_key. - --api-base / --api-key CLI flags on evolve_skill. Skipped: - pr/23 — already fixed differently via validate_skill_constraints in evolve_skill.py:58-82 (runs _check_skill_structure on full reassembled skill); pr/23's body-only rewrite would conflict. - pr/25 pick_skill.py — author-specific hardcoded paths, not portable. - pr/22 — ChatGPT OAuth backend overlaps with CodexLM; defer. - pr/20, pr/21 — MAD scoring; defer to a separate session. 26 new tests in tests/core/test_upstream_integration.py. Full suite: 308 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

pr/21 (nopenotagain): - evolution/core/mad_scoring.py — pure-math Median Absolute Deviation utilities (compute_mad, compute_confidence, ConfidenceResult, ...). - Holdout eval now reports MAD confidence on per-example deltas as a statistical sanity check on whether improvement is real: confidence = |mean_delta| / MAD(deltas) >= 2.0x → "likely real" >= 1.0x → "marginal" < 1.0x → "within noise" Surfaced in the results table and persisted to metrics.json (mad_confidence, mad_delta, mad_label). Skipped from pr/21: - hermes_judge.py — overlaps with our LLMJudge + adds subprocess complexity - gpt-5.4 default hijack — kept our gpt-4.1 defaults - Nous API auto-discovery from ~/.hermes/auth.json — too opinionated - Multi-trial mad_fitness_metric wiring into the optimizer — adds n_trials× LLM cost on every metric call. Post-hoc on holdout deltas gives the same statistical signal at a fraction of the cost. Also defers pr/22 ChatGPT OAuth backend — overlaps with CodexLM (pr/26) which already provides ChatGPT-via-OAuth via the codex CLI. 19 new MAD math tests in tests/core/test_mad_scoring.py. Full suite: 327 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Phase 2 of self-evolution. Optimizes the description field of a tool schema so the agent picks the right tool more reliably for a given task. Architecture mirrors Phase 1 (skill evolution): - evolution/tools/tool_module.py — ToolModule wraps a description as the optimizable parameter (signature.instructions). Forward pass returns a binary "yes/no" tool-pick decision. HTML sentinel delimiters survive GEPA wrapper rewrites and markdown horizontal rules. Untrusted-data preamble guards against prompt injection in third-party tool registries. load_tool_definition validates JSON schema + tool name regex. - evolution/tools/dataset.py — ToolDatasetBuilder generates contrastive synthetic data: positive tasks (where the tool fits) and negative tasks (where it does not). Polarity is encoded in EvalExample.category so the existing dataset/split infrastructure carries through. Rejects datasets with no positives or no negatives — fitness needs contrast to be informative. - evolution/tools/fitness.py — Contrastive metric. Score = 1.0 for correct decision (yes on positive / no on negative), else 0.0. Returns GEPA-compatible (score, feedback) Prediction with rationale-aware feedback so reflective optimization gets actionable signal. - evolution/tools/evolve_tool.py — CLI mirroring evolve_skill.py: --tool-def, --iterations, --optimizer-model, --eval-model, --use-minimax, --dry-run, --create-pr, --output-dir, --api-base, --api-key. Reuses ConstraintValidator (artifact_type="tool_description"), MAD confidence, scrub_secrets, progress tracker, and the proposal-bundle pattern. Tool definition format is a small JSON file: {"name": "search_files", "description": "...", "parameters": {...}} Constraints enforced on evolved descriptions: - max_tool_desc_size (default 500 chars; sent on every API call so must stay small) - max_prompt_growth (+20%) when compared to baseline - non-empty - prompt_injection scan (size/growth already handled by ConstraintValidator from Phase 1.) 39 new tests in tests/tools/ covering tool_module, dataset, and fitness. Full suite: 366 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

octo-patch and others added 25 commits April 6, 2026 23:16

Gate evolution success on artifact diffs

ab5dd13

fix: validate full skill files during evolution

2555a56

Merge branch 'pr-41' into innoscout/integrate-critical-fixes

8baa4fa

Merge branch 'pr-40' into innoscout/integrate-critical-fixes

65f7f8d

Merge branch 'pr-32' into innoscout/integrate-critical-fixes

a24722e

# Conflicts: # evolution/skills/evolve_skill.py

Merge branch 'pr-24' into innoscout/integrate-critical-fixes

ce5b1e2

fix: reconcile full skill structure validation

830b7bf

Merge branch 'pr-28' into innoscout/integrate-critical-fixes

24703ff

Merge branch 'pr-16' into innoscout/integrate-critical-fixes

354cfb5

# Conflicts: # evolution/skills/evolve_skill.py # tests/skills/test_evolve_skill.py

Merge branch 'pr-15' into innoscout/integrate-critical-fixes

8c72504

kartikkabadi mentioned this pull request May 3, 2026

feat: complete self-evolution phases 2-5 kartikkabadi/hermes-agent-self-evolution#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: integrate critical self-evolution pipeline fixes#42

fix: integrate critical self-evolution pipeline fixes#42
innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
innoscoutpro:innoscout/integrate-critical-fixes

innoscoutpro commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

innoscoutpro commented Apr 27, 2026

Summary

What changed

Issues addressed

Test plan

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants