fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39
fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39steezkelly wants to merge 2 commits intoNousResearch:mainfrom
Conversation
…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
…rmes_repo Addresses security review findings C2, H1, H2, H5, M1. evolution/core/external_importers.py - SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza, Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT 3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL, HF_TOKEN. Generic api_key/secret/token/credential assignment patterns. Existing test cases (177) still pass — patterns relaxed where the test suite expected loose matching (short tokens, bare PRIVATE KEY). - New scrub_secrets(text) helper for defence-in-depth scanning of outputs the model may have paraphrased into secret-shaped strings. evolution/skills/skill_module.py - find_skill rejects skill names containing path separators or shell metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector. - find_skill resolves and refuses any SKILL.md whose real path lies outside the skills/ tree (symlink-escape protection, H5). - Add SkillModule(treat_as_untrusted=True) preamble that tells the optimizer to treat skill body as DATA, not commands. Mitigates prompt-injection from third-party transcripts (C2). - Switch body delimiter from "\n\n---\n" to HTML-comment sentinels (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal rules survive extraction (forward-port of upstream PR NousResearch#39 idea). evolution/core/constraints.py - run_test_suite(hermes_repo) now resolves the path, then refuses to invoke pytest unless pyproject.toml + tests/ exist and pyproject references hermes-agent. Pytest auto-loads conftest.py, so pointing at an untrusted tree was equivalent to RCE (M1). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Architecture/maintainability audit after current local GEPA pipeline validation: I recommend not merging this PR in its current shape as one large stacked change. The root fixes are valid and still important, but the PR currently bundles several separable concerns plus generated output artifacts. Observed split points:
Current recommendation:
Current local evidence: the GEPA pipeline is unblocked locally with full model IDs (for example |
|
Closing this broad stacked PR as superseded for the ghost-improvement extraction fix by the focused PR #49. The remaining useful pieces here should be split into separate, reviewable PRs rather than merged as one bundle:
Keeping #38 open as the broader tracking issue; #49 now explicitly documents that it addresses only the ghost-improvement extraction portion. |
Summary
Fixes 4 interconnected bugs discovered during Hermes skill evolution pipeline testing:
Bug 1: Ghost-improvement — skill body extraction losing 89% of content
Root cause: Extraction used
\n\n---\nas separator between skill body and wrapper instructions. But skill bodies contain---as section dividers (5+ times insystematic-debuggingskill). Split happened at char 1,101, losing 8,919 of 10,020 chars.Fix: Replace separator with an HTML comment sentinel that can never appear in markdown.
Files:
evolution/skills/skill_module.py,evolution/skills/evolve_skill.pyBug 2: GEPA API conflict in DSPy 3.2.0
Root cause:
max_metric_calls+auto="light"are mutually exclusive in DSPy 3.2.0 GEPA implementation.Fix: Removed
auto="light"from GEPA call. Addedreflection_lmfor proper meta-optimization.Bug 3: Constraint validator only checked frontmatter
Root cause:
_check_skill_structurevalidated YAML frontmatter fields but never checked whether the markdown body was substantive.Fix: Added body validation requiring at least 2 of 3: headings, procedural content, substantial length. Also validates the full reassembled skill.
Bug 4: JSON parser brittle under LLM malformation
Root cause: DSPy's
dataset_builderused rawjson.loads()which fails on common LLM output patterns (trailing commas, single quotes, markdown fences).Fix: Added
_try_parse_json()with 6 fallback strategies includingast.literal_eval, regex extraction, trailing-comma fixing, and markdown fence stripping.Additional improvements
OPENROUTER_BASE_URLenv var support for all DSPy LM initialization callsmax_skill_sizeincreased 15KB to 50KB for evolved skills with few-shot examplesrun-evolution.shhelper script for provider selection (Nous/OpenRouter)skill_fitness_metricfixed for 5-arg GEPA compatibilitynum_threads=1to avoid rate limitsTesting: All changes are backward-compatible. Constraint validator additions only fail empty/trivial bodies.
Commits stacked:
7306d82Apply merged upstream PRs: skill text as optimizable instruction (fix: runtime bugs + make skill text optimizable by DSPy #24, Real session data evolution: fix GEPA + expand sessiondb filter #26, fix: correct GEPA arg, constraint input, and rate limit handling in evolve_skill #35)4844a40fix: ghost-improvement bug — HTML sentinel extraction, GEPA API, constraint validator, JSON parsing robustness