fix: integrate critical self-evolution pipeline fixes#42
Open
innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
Open
fix: integrate critical self-evolution pipeline fixes#42innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
innoscoutpro wants to merge 25 commits intoNousResearch:mainfrom
Conversation
- Add MiniMax chat model provider via OpenAI-compatible endpoint - Add MINIMAX_API_KEY and MINIMAX_BASE_URL config fields to EvolutionConfig - Add make_lm() helper to EvolutionConfig that routes MiniMax models to https://api.minimax.io/v1 with correct temperature (1.0, required by MiniMax) - Support bare model IDs and prefixed forms (minimax/, openai/) - Add --use-minimax CLI flag to evolve_skill.py for easy MiniMax selection - Update dataset_builder.py and fitness.py to use config.make_lm() - Add 16 unit tests covering MiniMax config and LM routing - Document MiniMax usage in README Supported models: MiniMax-M2.7, MiniMax-M2.7-highspeed
Three bug fixes that prevent the pipeline from running: 1. dataset_builder: LLM returns Python-style dicts (single quotes), not valid JSON. Added ast.literal_eval fallback + trailing comma fix so synthetic dataset generation doesn't crash on parse. 2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now max_metric_calls. Fixed the call and added auto='light'. 3. constraints: _check_skill_structure was checking the skill BODY for YAML frontmatter, which it never has after splitting. Rewrote to validate body structure (headings, procedural content, substance). One architectural improvement: 4. skill_module: Skill text was passed as an input field, so the optimizer could never mutate it. Restructured to embed skill text in the instruction template via with_instructions(), allowing MIPROv2/GEPA to propose improved skill bodies. Updated extraction logic in evolve_skill.py to pull evolved text from the compiled predictor's instruction.
Replaces the single keyword-overlap scorer with a weighted composite of five independent signals that spread scores across a much wider range: 1. Keyword overlap (25%) - stop-word filtered, F1-style blend 2. Character 3-gram similarity (25%) - Jaccard on char shingles 3. Structural pattern matching (20%) - code blocks, lists, headers 4. Length quality (15%) - proportional to expected output length 5. Content density (15%) - unique token ratio, avg token length, variety Also: - Returns dspy.Prediction(score=float, feedback=str) for GEPA reflective mutation compatibility - Feedback string highlights specific weaknesses for optimizer use - All scoring is deterministic (no LLM calls) for speed during optimization loops Fixes NousResearch#12
…jects Claude Code stores rich session transcripts (user prompts + assistant responses + tool calls) at ~/.claude/projects/<encoded-cwd>/<id>.jsonl. The previous ClaudeCodeImporter only read ~/.claude/history.jsonl, which is a flat log of *user prompts only* — no assistant responses. That meant Claude Code was the only sessiondb source that produced unpaired examples, while Copilot and Hermes both yielded (task_input, assistant_response) pairs. Downstream consumers (RelevanceFilter, build_dataset_from_external) already plumb assistant_response through, so the data shape gap was the only blocker. Changes: - Extend ClaudeCodeImporter with PROJECTS_DIR + _extract_from_projects. - Add _parse_claude_code_session helper, mirroring _parse_copilot_events. Handles user/assistant interleaving, tool_use/tool_result skipping, multi-block assistant turns, malformed JSON, and secret redaction. - New `source` arg on extract_messages: "auto" (default, prefers projects/, falls back to history.jsonl), "projects", or "history". - Existing tests updated to pass `source="history"` (now explicit), plus 12 new tests covering pair extraction, tool-result skipping, secret filtering, multi-session walking, limits, and auto fallback. Verified on real ~/.claude/projects/ data: yields paired examples with both task_input and assistant_response fields. Closes the data-quality gap noted in NousResearch#3 for Claude Code users.
Two unrelated stock-install bugs that together prevent the optimization loop from completing on a fresh clone: 1. **Missing optuna dependency.** `evolve_skill.py` falls back to MIPROv2 automatically when GEPA fails to initialize (which it currently always does on DSPy >=3.0 — see NousResearch#14, NousResearch#35, NousResearch#39). MIPROv2 imports `optuna` at `_optimize_prompt_parameters` time, so the fallback crashes with `ModuleNotFoundError: No module named 'optuna'` immediately after Step 2 finishes proposing instruction candidates. Switching the declared dependency from `dspy>=3.0.0` to `dspy[optuna]>=3.0.0` lets DSPy itself manage the version pin. 2. **No request timeout on LLM calls.** litellm's default request_timeout is unset, so any silent connection drop from the upstream provider (we hit this on a corporate/proxy gateway that drops long-lived POSTs without a TCP RST) hangs the optimization loop indefinitely with the python process holding an established but dead TCP socket at 0% CPU. We saw the entire 10-iteration loop block for 14+ minutes on a single hung call before manual intervention. Setting `litellm.request_timeout` at module import time gives every DSPy LM call a per-request deadline. Default 90s (generous for sonnet/opus reasoning tokens, short enough to detect a dead socket). Override via `LITELLM_REQUEST_TIMEOUT` env var. Verification: 143 tests pass (139 existing + 4 new tests covering the timeout default, env override, float parsing, and fail-fast on bad input). End-to-end run of `evolve_skill --skill llm-wiki-extract --eval-source synthetic --iterations 3` against a flaky upstream gateway now completes (before this fix it hung indefinitely).
# Conflicts: # evolution/skills/evolve_skill.py
# Conflicts: # evolution/skills/evolve_skill.py # tests/skills/test_evolve_skill.py
… output/ - Pin upper bounds on dspy/openai/pyyaml/click/rich to fence against API breaks and supply-chain auto-bumps (security review M4). - Move pytest/pytest-asyncio to bounded ranges in dev extras. - New optional [report] extra for generate_report.py's reportlab dep (audit C1.M4 — the import would fail without it on a stock install). - Add output/ and proposals/ to .gitignore (security review C1) — runs may write derived skill bodies and metrics that include paraphrased session content; never auto-commit. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…rmes_repo Addresses security review findings C2, H1, H2, H5, M1. evolution/core/external_importers.py - SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza, Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT 3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL, HF_TOKEN. Generic api_key/secret/token/credential assignment patterns. Existing test cases (177) still pass — patterns relaxed where the test suite expected loose matching (short tokens, bare PRIVATE KEY). - New scrub_secrets(text) helper for defence-in-depth scanning of outputs the model may have paraphrased into secret-shaped strings. evolution/skills/skill_module.py - find_skill rejects skill names containing path separators or shell metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector. - find_skill resolves and refuses any SKILL.md whose real path lies outside the skills/ tree (symlink-escape protection, H5). - Add SkillModule(treat_as_untrusted=True) preamble that tells the optimizer to treat skill body as DATA, not commands. Mitigates prompt-injection from third-party transcripts (C2). - Switch body delimiter from "\n\n---\n" to HTML-comment sentinels (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal rules survive extraction (forward-port of upstream PR NousResearch#39 idea). evolution/core/constraints.py - run_test_suite(hermes_repo) now resolves the path, then refuses to invoke pytest unless pyproject.toml + tests/ exist and pyproject references hermes-agent. Pytest auto-loads conftest.py, so pointing at an untrusted tree was equivalent to RCE (M1). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…estration Closes the remaining audit items from upstream issue NousResearch#33 and applies the must-fix orchestration findings from the local code+security review. evolution/core/benchmark_gate.py (new) - BenchmarkGate.run_all() returns one BenchmarkResult per enabled gate; TBLite runner is a deliberate stub (skipped=True with explanatory message) until hermes-agent batch_runner is wired in. Closes NousResearch#33 C3 structurally — config flag now produces a real, observable result instead of being silently ignored. evolution/skills/evolve_skill.py (rewrite) - Wire --run-tests: validator.run_test_suite() is called after constraint validation; failure rejects the variant. Closes NousResearch#33 C2. - Wire --create-pr: writes a proposal bundle to output/proposals/<skill>/<ts>/ (baseline_skill.md, evolved_skill.md, metrics.json, decision.json, diff.patch). Filesystem-only — never runs git ops or pushes. Closes NousResearch#33 H1. - --use-minimax sets the MiniMax model only as a *default* now; user- supplied --optimizer-model / --eval-model / --judge-model always win. Closes security review H3 (jurisdictional surprise). - New --consent-external-ingest flag is required for --eval-source sessiondb. The pipeline aborts with a red banner otherwise. Enforced before --dry-run so users learn about the requirement during setup validation. Closes security review H4. - New --use-llm-judge flag opts in to LLMJudge scoring; default stays on the deterministic multi-signal metric. - New --judge-model flag — defaults to --eval-model when unset. Resolves judge_model inconsistency between MiniMax / non-MiniMax paths flagged in code review. - Body extraction uses HTML-comment sentinels and distinguishes "no-op" (evolved == baseline) from "extraction failed" (optimizer returned no usable output). Closes code review NousResearch#6. - GEPA is constructed with max_metric_calls only (no auto="light" conflict). Except clause narrowed to TypeError/AttributeError/ ImportError so genuine fitness errors bubble. Closes code review NousResearch#4. - scrub_secrets() runs over the evolved body before write — defence- in-depth against the model paraphrasing a leaked secret. Closes security review H2. - Output paths now use config.output_dir consistently; --output-dir CLI flag added. Closes code review NousResearch#10. - Cleaned up dead imports (Panel, FitnessScore, get_hermes_agent_path, re as _re) and the stale "import re as _re" never referenced. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
177 → 237 passing tests after this commit. tests/core/test_fitness.py (new) - Multi-signal scorer: empty/whitespace output, identical text, unrelated text, prediction shape, keyword overlap edge cases (identical, disjoint, empty expected, stop-word filtering), char n-gram boundaries, structural match recall + noise penalty, length-quality boundaries, content density, score parsing. - LLMJudge wiring: uninitialized → deterministic, judge failure → fallback with "[judge unavailable: ...]" feedback flag, judge success → composite (0.4/0.3/0.3 weighting). - FitnessScore composite + length-penalty arithmetic. tests/core/test_benchmark_gate.py (new) - Disabled config returns no results. - Enabled config returns one result with skipped=True / passed=True / threshold populated and a "not implemented" message. - Display message formatting for skipped / passed / failed states. tests/core/test_security_hardening.py (new) - find_skill rejects path-traversal names, path separators, shell metacharacters; accepts legit names; refuses symlink escape from skills tree; returns None for missing skills/. - scrub_secrets redacts Anthropic / JWT / password assignments; preserves innocent prose; honours custom replacement. - SkillModule wraps body with untrusted-data preamble; sentinels are always present; bodies containing markdown horizontal rules survive round-trip extraction. - run_test_suite refuses non-existent paths, unrelated projects (pyproject doesn't reference hermes-agent), and missing pyproject.toml. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… status - New "How it integrates with hermes-agent" section answers upstream issue NousResearch#18 explicitly: discovery (HERMES_AGENT_REPO / ~/.hermes / sibling), the 6-step pipeline, and the deployment story (proposal bundle → human review → manual copy or PR; never auto-merge). - New Privacy & Security section: documents what each --eval-source sends where, the --consent-external-ingest gate, the --use-minimax precedence rule (user models win), what the secret detector does and does not catch, the output/ gitignore caveat, the untrusted-data preamble, and the run_test_suite path validation. - Phase 2-5 table now marks the empty package directories as "Stub only" instead of "Planned" — evolution/tools/, prompts/, code/, monitor/ each contain only an empty __init__.py and the prior wording was misleading. - Updated Quick Start with --use-llm-judge, --consent-external-ingest, --run-tests, --create-pr examples. - Guardrails section rewritten to reflect what is actually enforced (structural integrity on full SKILL.md, body-only size/growth, no-op vs extraction-failure distinction, optional pytest gate, optional benchmark gate stub, proposal-bundle-only deployment). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
HIGH: - consent gate on standalone external_importers CLI (was bypassable) - _load_skill_text validates skill name against [A-Za-z0-9_.-]+ - --source-project filter limits Claude Code mining to one project - PII_PATTERNS scrub emails, IPs, phone numbers, SSNs - detailed consent warning enumerates actual data categories sent MEDIUM: - repr=False on minimax_api_key to prevent leaks via repr/logs - validate_model_string() rejects URLs and unknown provider prefixes - 9 new secret patterns: OPENSSH key, Databricks, DigitalOcean, npm, PyPI, Vault, Telegram, Supabase, Vercel + connection strings (postgres://, mysql://, mongodb://, redis://, amqp://) with creds - _check_prompt_injection scans evolved text for known patterns (ignore previous, exfiltrate, reveal system prompt, ...) - _warn_stale_datasets warns when JSONL files older than 7 days - MiniMax jurisdiction note in consent text when --use-minimax active - litellm pinned explicitly (>=1.50.0,<2) - requirements.lock for reproducible builds 48 new tests in tests/core/test_audit_fixes.py covering every fix. Full suite: 282 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
pr/26 (Kevin Laithwaite): - evolution/core/codex_lm.py — DSPy LM wrapping `codex exec --json` for GPT-5.4 via ChatGPT Plus OAuth (no API key needed) - RelevanceFilter three-stage pipeline: LLM-expanded keywords (one call, 30-50 synonyms/phrases) → substring pre-filter on full corpus → LLM scoring. Raises candidate budget from 3x to 8x to let the LLM reject borderline cases. Verified upstream: +160% relevant examples found (17→44 from 2590-message corpus) and downstream +11.5% holdout gain. pr/25 (Eric): - evolution/monitor/progress.py — SQLite-backed run tracker at ~/.hermes/evolution_progress.db with start_run/log_event/complete_run/ fail_run/get_active_run/get_run_history/get_run_events. - evolve_skill.py wires hooks at: start, skill load, dataset built, optimization complete, validation fail, run complete. pr/19 (Hermes Bot): - EvolutionConfig.api_base + api_key for vLLM/Ollama/LiteLLM-compatible endpoints. Forwarded to dspy.LM only when set; MiniMax routing unchanged. api_key uses repr=False for the same leak protection as minimax_api_key. - --api-base / --api-key CLI flags on evolve_skill. Skipped: - pr/23 — already fixed differently via validate_skill_constraints in evolve_skill.py:58-82 (runs _check_skill_structure on full reassembled skill); pr/23's body-only rewrite would conflict. - pr/25 pick_skill.py — author-specific hardcoded paths, not portable. - pr/22 — ChatGPT OAuth backend overlaps with CodexLM; defer. - pr/20, pr/21 — MAD scoring; defer to a separate session. 26 new tests in tests/core/test_upstream_integration.py. Full suite: 308 passed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
pr/21 (nopenotagain):
- evolution/core/mad_scoring.py — pure-math Median Absolute Deviation
utilities (compute_mad, compute_confidence, ConfidenceResult, ...).
- Holdout eval now reports MAD confidence on per-example deltas as a
statistical sanity check on whether improvement is real:
confidence = |mean_delta| / MAD(deltas)
>= 2.0x → "likely real"
>= 1.0x → "marginal"
< 1.0x → "within noise"
Surfaced in the results table and persisted to metrics.json
(mad_confidence, mad_delta, mad_label).
Skipped from pr/21:
- hermes_judge.py — overlaps with our LLMJudge + adds subprocess complexity
- gpt-5.4 default hijack — kept our gpt-4.1 defaults
- Nous API auto-discovery from ~/.hermes/auth.json — too opinionated
- Multi-trial mad_fitness_metric wiring into the optimizer — adds n_trials×
LLM cost on every metric call. Post-hoc on holdout deltas gives the same
statistical signal at a fraction of the cost.
Also defers pr/22 ChatGPT OAuth backend — overlaps with CodexLM (pr/26)
which already provides ChatGPT-via-OAuth via the codex CLI.
19 new MAD math tests in tests/core/test_mad_scoring.py.
Full suite: 327 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Phase 2 of self-evolution. Optimizes the description field of a tool
schema so the agent picks the right tool more reliably for a given task.
Architecture mirrors Phase 1 (skill evolution):
- evolution/tools/tool_module.py — ToolModule wraps a description as the
optimizable parameter (signature.instructions). Forward pass returns a
binary "yes/no" tool-pick decision. HTML sentinel delimiters survive
GEPA wrapper rewrites and markdown horizontal rules. Untrusted-data
preamble guards against prompt injection in third-party tool registries.
load_tool_definition validates JSON schema + tool name regex.
- evolution/tools/dataset.py — ToolDatasetBuilder generates contrastive
synthetic data: positive tasks (where the tool fits) and negative tasks
(where it does not). Polarity is encoded in EvalExample.category so the
existing dataset/split infrastructure carries through. Rejects datasets
with no positives or no negatives — fitness needs contrast to be
informative.
- evolution/tools/fitness.py — Contrastive metric. Score = 1.0 for
correct decision (yes on positive / no on negative), else 0.0. Returns
GEPA-compatible (score, feedback) Prediction with rationale-aware
feedback so reflective optimization gets actionable signal.
- evolution/tools/evolve_tool.py — CLI mirroring evolve_skill.py:
--tool-def, --iterations, --optimizer-model, --eval-model, --use-minimax,
--dry-run, --create-pr, --output-dir, --api-base, --api-key. Reuses
ConstraintValidator (artifact_type="tool_description"), MAD confidence,
scrub_secrets, progress tracker, and the proposal-bundle pattern.
Tool definition format is a small JSON file:
{"name": "search_files", "description": "...", "parameters": {...}}
Constraints enforced on evolved descriptions:
- max_tool_desc_size (default 500 chars; sent on every API call so
must stay small)
- max_prompt_growth (+20%) when compared to baseline
- non-empty
- prompt_injection scan
(size/growth already handled by ConstraintValidator from Phase 1.)
39 new tests in tests/tools/ covering tool_module, dataset, and fitness.
Full suite: 366 passed.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Consolidates the focused, mergeable fixes from the open PR backlog into one tested integration branch.
This PR incorporates/adapts work from:
~/.claude/projectsWhat changed
dspy[optuna]and configures a default LiteLLM request timeout to avoid hung optimization runs.SKILL.md, while size and growth limits still apply to the mutable body.max_stepsAPI tomax_metric_calls.EvolutionConfig.make_lm()and CLI/provider plumbing.Issues addressed
Closes #10
Closes #11
Closes #12
Closes #34
Closes #38
Partially addresses #33 by improving fitness signal, mutation behavior, dependency/runtime reliability, and no-op gating. Remaining #33 items should be split into follow-ups: benchmark gate, PR proposal/deployment path, and true E2E evolution test.
Helps with #37 by reducing the PR backlog into a single tested integration branch.
Test plan
Result locally:
Notes
I intentionally did not merge every open PR as-is. Several are duplicates, superseded, stacked, or too broad/generated-artifact-heavy. This branch takes the smallest coherent set that closes the core technical blockers and keeps the suite green.