Skip to content

fix: integrate critical self-evolution pipeline fixes#42

Open
innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
innoscoutpro:innoscout/integrate-critical-fixes
Open

fix: integrate critical self-evolution pipeline fixes#42
innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
innoscoutpro:innoscout/integrate-critical-fixes

Conversation

@innoscoutpro
Copy link
Copy Markdown

Summary

Consolidates the focused, mergeable fixes from the open PR backlog into one tested integration branch.

This PR incorporates/adapts work from:

What changed

  • Uses dspy[optuna] and configures a default LiteLLM request timeout to avoid hung optimization runs.
  • Adds Claude Code project transcript import support.
  • Fixes skill constraint validation so frontmatter validation runs on a reassembled full SKILL.md, while size and growth limits still apply to the mutable body.
  • Makes skill text part of the optimizable instruction surface instead of a dead instance attribute.
  • Updates GEPA usage from the removed max_steps API to max_metric_calls.
  • Replaces naive keyword-overlap-only scoring with a deterministic multi-signal metric that returns feedback for GEPA.
  • Adds an explicit no-op gate: a run is only successful if score improves and the evolved artifact differs from baseline.
  • Adds MiniMax provider support via EvolutionConfig.make_lm() and CLI/provider plumbing.

Issues addressed

Closes #10
Closes #11
Closes #12
Closes #34
Closes #38

Partially addresses #33 by improving fitness signal, mutation behavior, dependency/runtime reliability, and no-op gating. Remaining #33 items should be split into follow-ups: benchmark gate, PR proposal/deployment path, and true E2E evolution test.

Helps with #37 by reducing the PR backlog into a single tested integration branch.

Test plan

python3 -m venv .venv
. .venv/bin/activate
python -m pip install -U pip setuptools wheel
python -m pip install -e '.[dev]'
pytest -q

Result locally:

177 passed, 11 warnings

Notes

I intentionally did not merge every open PR as-is. Several are duplicates, superseded, stacked, or too broad/generated-artifact-heavy. This branch takes the smallest coherent set that closes the core technical blockers and keeps the suite green.

octo-patch and others added 25 commits April 6, 2026 23:16
- Add MiniMax chat model provider via OpenAI-compatible endpoint
- Add MINIMAX_API_KEY and MINIMAX_BASE_URL config fields to EvolutionConfig
- Add make_lm() helper to EvolutionConfig that routes MiniMax models to
  https://api.minimax.io/v1 with correct temperature (1.0, required by MiniMax)
- Support bare model IDs and prefixed forms (minimax/, openai/)
- Add --use-minimax CLI flag to evolve_skill.py for easy MiniMax selection
- Update dataset_builder.py and fitness.py to use config.make_lm()
- Add 16 unit tests covering MiniMax config and LM routing
- Document MiniMax usage in README

Supported models: MiniMax-M2.7, MiniMax-M2.7-highspeed
Three bug fixes that prevent the pipeline from running:

1. dataset_builder: LLM returns Python-style dicts (single quotes), not
   valid JSON. Added ast.literal_eval fallback + trailing comma fix so
   synthetic dataset generation doesn't crash on parse.

2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now
   max_metric_calls. Fixed the call and added auto='light'.

3. constraints: _check_skill_structure was checking the skill BODY for
   YAML frontmatter, which it never has after splitting. Rewrote to
   validate body structure (headings, procedural content, substance).

One architectural improvement:

4. skill_module: Skill text was passed as an input field, so the
   optimizer could never mutate it. Restructured to embed skill text
   in the instruction template via with_instructions(), allowing
   MIPROv2/GEPA to propose improved skill bodies. Updated extraction
   logic in evolve_skill.py to pull evolved text from the compiled
   predictor's instruction.
Replaces the single keyword-overlap scorer with a weighted composite of
five independent signals that spread scores across a much wider range:

1. Keyword overlap (25%) - stop-word filtered, F1-style blend
2. Character 3-gram similarity (25%) - Jaccard on char shingles
3. Structural pattern matching (20%) - code blocks, lists, headers
4. Length quality (15%) - proportional to expected output length
5. Content density (15%) - unique token ratio, avg token length, variety

Also:
- Returns dspy.Prediction(score=float, feedback=str) for GEPA
  reflective mutation compatibility
- Feedback string highlights specific weaknesses for optimizer use
- All scoring is deterministic (no LLM calls) for speed during
  optimization loops

Fixes NousResearch#12
…jects

Claude Code stores rich session transcripts (user prompts + assistant
responses + tool calls) at ~/.claude/projects/<encoded-cwd>/<id>.jsonl.
The previous ClaudeCodeImporter only read ~/.claude/history.jsonl, which
is a flat log of *user prompts only* — no assistant responses.

That meant Claude Code was the only sessiondb source that produced
unpaired examples, while Copilot and Hermes both yielded
(task_input, assistant_response) pairs. Downstream consumers
(RelevanceFilter, build_dataset_from_external) already plumb
assistant_response through, so the data shape gap was the only blocker.

Changes:
- Extend ClaudeCodeImporter with PROJECTS_DIR + _extract_from_projects.
- Add _parse_claude_code_session helper, mirroring _parse_copilot_events.
  Handles user/assistant interleaving, tool_use/tool_result skipping,
  multi-block assistant turns, malformed JSON, and secret redaction.
- New `source` arg on extract_messages: "auto" (default, prefers
  projects/, falls back to history.jsonl), "projects", or "history".
- Existing tests updated to pass `source="history"` (now explicit), plus
  12 new tests covering pair extraction, tool-result skipping, secret
  filtering, multi-session walking, limits, and auto fallback.

Verified on real ~/.claude/projects/ data: yields paired examples with
both task_input and assistant_response fields.

Closes the data-quality gap noted in NousResearch#3 for Claude Code users.
Two unrelated stock-install bugs that together prevent the optimization
loop from completing on a fresh clone:

1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2
   automatically when GEPA fails to initialize (which it currently always
   does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at
   `_optimize_prompt_parameters` time, so the fallback crashes with
   `ModuleNotFoundError: No module named 'optuna'` immediately after
   Step 2 finishes proposing instruction candidates. Switching the
   declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets
   DSPy itself manage the version pin.

2. **No request timeout on LLM calls.** litellm's default request_timeout
   is unset, so any silent connection drop from the upstream provider
   (we hit this on a corporate/proxy gateway that drops long-lived
   POSTs without a TCP RST) hangs the optimization loop indefinitely
   with the python process holding an established but dead TCP socket
   at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes
   on a single hung call before manual intervention.

   Setting `litellm.request_timeout` at module import time gives every
   DSPy LM call a per-request deadline. Default 90s (generous for
   sonnet/opus reasoning tokens, short enough to detect a dead socket).
   Override via `LITELLM_REQUEST_TIMEOUT` env var.

Verification: 143 tests pass (139 existing + 4 new tests covering the
timeout default, env override, float parsing, and fail-fast on bad
input). End-to-end run of `evolve_skill --skill llm-wiki-extract
--eval-source synthetic --iterations 3` against a flaky upstream
gateway now completes (before this fix it hung indefinitely).
# Conflicts:
#	evolution/skills/evolve_skill.py
# Conflicts:
#	evolution/skills/evolve_skill.py
#	tests/skills/test_evolve_skill.py
… output/

- Pin upper bounds on dspy/openai/pyyaml/click/rich to fence against API
  breaks and supply-chain auto-bumps (security review M4).
- Move pytest/pytest-asyncio to bounded ranges in dev extras.
- New optional [report] extra for generate_report.py's reportlab dep
  (audit C1.M4 — the import would fail without it on a stock install).
- Add output/ and proposals/ to .gitignore (security review C1) — runs
  may write derived skill bodies and metrics that include paraphrased
  session content; never auto-commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…rmes_repo

Addresses security review findings C2, H1, H2, H5, M1.

evolution/core/external_importers.py
- SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token
  prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza,
  Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT
  3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL,
  HF_TOKEN. Generic api_key/secret/token/credential assignment patterns.
  Existing test cases (177) still pass — patterns relaxed where the test
  suite expected loose matching (short tokens, bare PRIVATE KEY).
- New scrub_secrets(text) helper for defence-in-depth scanning of
  outputs the model may have paraphrased into secret-shaped strings.

evolution/skills/skill_module.py
- find_skill rejects skill names containing path separators or shell
  metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector.
- find_skill resolves and refuses any SKILL.md whose real path lies
  outside the skills/ tree (symlink-escape protection, H5).
- Add SkillModule(treat_as_untrusted=True) preamble that tells the
  optimizer to treat skill body as DATA, not commands. Mitigates
  prompt-injection from third-party transcripts (C2).
- Switch body delimiter from "\n\n---\n" to HTML-comment sentinels
  (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal
  rules survive extraction (forward-port of upstream PR NousResearch#39 idea).

evolution/core/constraints.py
- run_test_suite(hermes_repo) now resolves the path, then refuses to
  invoke pytest unless pyproject.toml + tests/ exist and pyproject
  references hermes-agent. Pytest auto-loads conftest.py, so pointing
  at an untrusted tree was equivalent to RCE (M1).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports
the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial.

evolution/core/fitness.py
- Replace conciseness dimension with completeness — judges should
  penalise omissions, not reward brevity. Composite weight now
  0.4 correctness + 0.3 procedure + 0.3 completeness.
- New init_fitness_metric(config, skill_text, use_llm_judge=True) /
  reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge
  with the completeness rubric is the primary scorer; the deterministic
  multi-signal scorer becomes the fallback. When False (default), the
  metric stays purely deterministic and zero-cost — appropriate for
  fast iteration and for runs the user doesn't want to send to a judge.
- skill_fitness_metric accepts the 5-arg GEPA signature
  (gold, pred, trace, pred_name, pred_trace) so it works with both
  GEPA and the legacy 3-arg metric API.
- Judge failures fall through to deterministic with a "[judge
  unavailable: <ExceptionClass>]" prefix in feedback so users can see
  why scores look heuristic mid-run.

evolution/core/dataset_builder.py
- Replace inline 3-strategy JSON recovery with a 6-strategy
  _try_parse_json_list helper: direct json, ast.literal_eval (safer
  than eval, but parses Python-literal single-quoted dicts),
  array-extraction-then-parse, ast.literal_eval on extracted candidate,
  trailing-comma-and-quote-fix, markdown-fence stripping, and a
  last-resort per-block scan. Returns None instead of raising so the
  caller can produce a useful error.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…estration

Closes the remaining audit items from upstream issue NousResearch#33 and applies the
must-fix orchestration findings from the local code+security review.

evolution/core/benchmark_gate.py (new)
- BenchmarkGate.run_all() returns one BenchmarkResult per enabled gate;
  TBLite runner is a deliberate stub (skipped=True with explanatory
  message) until hermes-agent batch_runner is wired in. Closes NousResearch#33 C3
  structurally — config flag now produces a real, observable result
  instead of being silently ignored.

evolution/skills/evolve_skill.py (rewrite)
- Wire --run-tests: validator.run_test_suite() is called after
  constraint validation; failure rejects the variant. Closes NousResearch#33 C2.
- Wire --create-pr: writes a proposal bundle to
  output/proposals/<skill>/<ts>/ (baseline_skill.md, evolved_skill.md,
  metrics.json, decision.json, diff.patch). Filesystem-only — never
  runs git ops or pushes. Closes NousResearch#33 H1.
- --use-minimax sets the MiniMax model only as a *default* now; user-
  supplied --optimizer-model / --eval-model / --judge-model always win.
  Closes security review H3 (jurisdictional surprise).
- New --consent-external-ingest flag is required for --eval-source
  sessiondb. The pipeline aborts with a red banner otherwise. Enforced
  before --dry-run so users learn about the requirement during setup
  validation. Closes security review H4.
- New --use-llm-judge flag opts in to LLMJudge scoring; default stays
  on the deterministic multi-signal metric.
- New --judge-model flag — defaults to --eval-model when unset.
  Resolves judge_model inconsistency between MiniMax / non-MiniMax
  paths flagged in code review.
- Body extraction uses HTML-comment sentinels and distinguishes
  "no-op" (evolved == baseline) from "extraction failed" (optimizer
  returned no usable output). Closes code review NousResearch#6.
- GEPA is constructed with max_metric_calls only (no auto="light"
  conflict). Except clause narrowed to TypeError/AttributeError/
  ImportError so genuine fitness errors bubble. Closes code review NousResearch#4.
- scrub_secrets() runs over the evolved body before write — defence-
  in-depth against the model paraphrasing a leaked secret. Closes
  security review H2.
- Output paths now use config.output_dir consistently; --output-dir
  CLI flag added. Closes code review NousResearch#10.
- Cleaned up dead imports (Panel, FitnessScore, get_hermes_agent_path,
  re as _re) and the stale "import re as _re" never referenced.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
177 → 237 passing tests after this commit.

tests/core/test_fitness.py (new)
- Multi-signal scorer: empty/whitespace output, identical text, unrelated
  text, prediction shape, keyword overlap edge cases (identical,
  disjoint, empty expected, stop-word filtering), char n-gram boundaries,
  structural match recall + noise penalty, length-quality boundaries,
  content density, score parsing.
- LLMJudge wiring: uninitialized → deterministic, judge failure → fallback
  with "[judge unavailable: ...]" feedback flag, judge success →
  composite (0.4/0.3/0.3 weighting).
- FitnessScore composite + length-penalty arithmetic.

tests/core/test_benchmark_gate.py (new)
- Disabled config returns no results.
- Enabled config returns one result with skipped=True / passed=True /
  threshold populated and a "not implemented" message.
- Display message formatting for skipped / passed / failed states.

tests/core/test_security_hardening.py (new)
- find_skill rejects path-traversal names, path separators, shell
  metacharacters; accepts legit names; refuses symlink escape from
  skills tree; returns None for missing skills/.
- scrub_secrets redacts Anthropic / JWT / password assignments;
  preserves innocent prose; honours custom replacement.
- SkillModule wraps body with untrusted-data preamble; sentinels are
  always present; bodies containing markdown horizontal rules survive
  round-trip extraction.
- run_test_suite refuses non-existent paths, unrelated projects
  (pyproject doesn't reference hermes-agent), and missing pyproject.toml.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… status

- New "How it integrates with hermes-agent" section answers upstream
  issue NousResearch#18 explicitly: discovery (HERMES_AGENT_REPO / ~/.hermes /
  sibling), the 6-step pipeline, and the deployment story (proposal
  bundle → human review → manual copy or PR; never auto-merge).
- New Privacy & Security section: documents what each --eval-source
  sends where, the --consent-external-ingest gate, the --use-minimax
  precedence rule (user models win), what the secret detector does
  and does not catch, the output/ gitignore caveat, the
  untrusted-data preamble, and the run_test_suite path validation.
- Phase 2-5 table now marks the empty package directories as
  "Stub only" instead of "Planned" — evolution/tools/, prompts/,
  code/, monitor/ each contain only an empty __init__.py and the
  prior wording was misleading.
- Updated Quick Start with --use-llm-judge, --consent-external-ingest,
  --run-tests, --create-pr examples.
- Guardrails section rewritten to reflect what is actually enforced
  (structural integrity on full SKILL.md, body-only size/growth, no-op
  vs extraction-failure distinction, optional pytest gate, optional
  benchmark gate stub, proposal-bundle-only deployment).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
HIGH:
- consent gate on standalone external_importers CLI (was bypassable)
- _load_skill_text validates skill name against [A-Za-z0-9_.-]+
- --source-project filter limits Claude Code mining to one project
- PII_PATTERNS scrub emails, IPs, phone numbers, SSNs
- detailed consent warning enumerates actual data categories sent

MEDIUM:
- repr=False on minimax_api_key to prevent leaks via repr/logs
- validate_model_string() rejects URLs and unknown provider prefixes
- 9 new secret patterns: OPENSSH key, Databricks, DigitalOcean, npm,
  PyPI, Vault, Telegram, Supabase, Vercel + connection strings
  (postgres://, mysql://, mongodb://, redis://, amqp://) with creds
- _check_prompt_injection scans evolved text for known patterns
  (ignore previous, exfiltrate, reveal system prompt, ...)
- _warn_stale_datasets warns when JSONL files older than 7 days
- MiniMax jurisdiction note in consent text when --use-minimax active
- litellm pinned explicitly (>=1.50.0,<2)
- requirements.lock for reproducible builds

48 new tests in tests/core/test_audit_fixes.py covering every fix.
Full suite: 282 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
pr/26 (Kevin Laithwaite):
- evolution/core/codex_lm.py — DSPy LM wrapping `codex exec --json` for
  GPT-5.4 via ChatGPT Plus OAuth (no API key needed)
- RelevanceFilter three-stage pipeline: LLM-expanded keywords (one call,
  30-50 synonyms/phrases) → substring pre-filter on full corpus → LLM
  scoring. Raises candidate budget from 3x to 8x to let the LLM reject
  borderline cases. Verified upstream: +160% relevant examples found
  (17→44 from 2590-message corpus) and downstream +11.5% holdout gain.

pr/25 (Eric):
- evolution/monitor/progress.py — SQLite-backed run tracker at
  ~/.hermes/evolution_progress.db with start_run/log_event/complete_run/
  fail_run/get_active_run/get_run_history/get_run_events.
- evolve_skill.py wires hooks at: start, skill load, dataset built,
  optimization complete, validation fail, run complete.

pr/19 (Hermes Bot):
- EvolutionConfig.api_base + api_key for vLLM/Ollama/LiteLLM-compatible
  endpoints. Forwarded to dspy.LM only when set; MiniMax routing
  unchanged. api_key uses repr=False for the same leak protection as
  minimax_api_key.
- --api-base / --api-key CLI flags on evolve_skill.

Skipped:
- pr/23 — already fixed differently via validate_skill_constraints in
  evolve_skill.py:58-82 (runs _check_skill_structure on full reassembled
  skill); pr/23's body-only rewrite would conflict.
- pr/25 pick_skill.py — author-specific hardcoded paths, not portable.
- pr/22 — ChatGPT OAuth backend overlaps with CodexLM; defer.
- pr/20, pr/21 — MAD scoring; defer to a separate session.

26 new tests in tests/core/test_upstream_integration.py.
Full suite: 308 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
pr/21 (nopenotagain):
- evolution/core/mad_scoring.py — pure-math Median Absolute Deviation
  utilities (compute_mad, compute_confidence, ConfidenceResult, ...).
- Holdout eval now reports MAD confidence on per-example deltas as a
  statistical sanity check on whether improvement is real:
    confidence = |mean_delta| / MAD(deltas)
    >= 2.0x → "likely real"
    >= 1.0x → "marginal"
    <  1.0x → "within noise"
  Surfaced in the results table and persisted to metrics.json
  (mad_confidence, mad_delta, mad_label).

Skipped from pr/21:
- hermes_judge.py — overlaps with our LLMJudge + adds subprocess complexity
- gpt-5.4 default hijack — kept our gpt-4.1 defaults
- Nous API auto-discovery from ~/.hermes/auth.json — too opinionated
- Multi-trial mad_fitness_metric wiring into the optimizer — adds n_trials×
  LLM cost on every metric call. Post-hoc on holdout deltas gives the same
  statistical signal at a fraction of the cost.

Also defers pr/22 ChatGPT OAuth backend — overlaps with CodexLM (pr/26)
which already provides ChatGPT-via-OAuth via the codex CLI.

19 new MAD math tests in tests/core/test_mad_scoring.py.
Full suite: 327 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Phase 2 of self-evolution. Optimizes the description field of a tool
schema so the agent picks the right tool more reliably for a given task.

Architecture mirrors Phase 1 (skill evolution):

- evolution/tools/tool_module.py — ToolModule wraps a description as the
  optimizable parameter (signature.instructions). Forward pass returns a
  binary "yes/no" tool-pick decision. HTML sentinel delimiters survive
  GEPA wrapper rewrites and markdown horizontal rules. Untrusted-data
  preamble guards against prompt injection in third-party tool registries.
  load_tool_definition validates JSON schema + tool name regex.

- evolution/tools/dataset.py — ToolDatasetBuilder generates contrastive
  synthetic data: positive tasks (where the tool fits) and negative tasks
  (where it does not). Polarity is encoded in EvalExample.category so the
  existing dataset/split infrastructure carries through. Rejects datasets
  with no positives or no negatives — fitness needs contrast to be
  informative.

- evolution/tools/fitness.py — Contrastive metric. Score = 1.0 for
  correct decision (yes on positive / no on negative), else 0.0. Returns
  GEPA-compatible (score, feedback) Prediction with rationale-aware
  feedback so reflective optimization gets actionable signal.

- evolution/tools/evolve_tool.py — CLI mirroring evolve_skill.py:
  --tool-def, --iterations, --optimizer-model, --eval-model, --use-minimax,
  --dry-run, --create-pr, --output-dir, --api-base, --api-key. Reuses
  ConstraintValidator (artifact_type="tool_description"), MAD confidence,
  scrub_secrets, progress tracker, and the proposal-bundle pattern.

Tool definition format is a small JSON file:
    {"name": "search_files", "description": "...", "parameters": {...}}

Constraints enforced on evolved descriptions:
  - max_tool_desc_size (default 500 chars; sent on every API call so
    must stay small)
  - max_prompt_growth (+20%) when compared to baseline
  - non-empty
  - prompt_injection scan
  (size/growth already handled by ConstraintValidator from Phase 1.)

39 new tests in tests/tools/ covering tool_module, dataset, and fitness.
Full suite: 366 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment