Skip to content

fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39

Closed
steezkelly wants to merge 2 commits intoNousResearch:mainfrom
steezkelly:fix/ghost-improvement-bug
Closed

fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39
steezkelly wants to merge 2 commits intoNousResearch:mainfrom
steezkelly:fix/ghost-improvement-bug

Conversation

@steezkelly
Copy link
Copy Markdown

Summary

Fixes 4 interconnected bugs discovered during Hermes skill evolution pipeline testing:

Bug 1: Ghost-improvement — skill body extraction losing 89% of content

Root cause: Extraction used \n\n---\n as separator between skill body and wrapper instructions. But skill bodies contain --- as section dividers (5+ times in systematic-debugging skill). Split happened at char 1,101, losing 8,919 of 10,020 chars.

Fix: Replace separator with an HTML comment sentinel that can never appear in markdown.

Files: evolution/skills/skill_module.py, evolution/skills/evolve_skill.py

Bug 2: GEPA API conflict in DSPy 3.2.0

Root cause: max_metric_calls + auto="light" are mutually exclusive in DSPy 3.2.0 GEPA implementation.

Fix: Removed auto="light" from GEPA call. Added reflection_lm for proper meta-optimization.

Bug 3: Constraint validator only checked frontmatter

Root cause: _check_skill_structure validated YAML frontmatter fields but never checked whether the markdown body was substantive.

Fix: Added body validation requiring at least 2 of 3: headings, procedural content, substantial length. Also validates the full reassembled skill.

Bug 4: JSON parser brittle under LLM malformation

Root cause: DSPy's dataset_builder used raw json.loads() which fails on common LLM output patterns (trailing commas, single quotes, markdown fences).

Fix: Added _try_parse_json() with 6 fallback strategies including ast.literal_eval, regex extraction, trailing-comma fixing, and markdown fence stripping.

Additional improvements

  • OPENROUTER_BASE_URL env var support for all DSPy LM initialization calls
  • max_skill_size increased 15KB to 50KB for evolved skills with few-shot examples
  • run-evolution.sh helper script for provider selection (Nous/OpenRouter)
  • skill_fitness_metric fixed for 5-arg GEPA compatibility
  • GEPA fallback to MIPROv2 with num_threads=1 to avoid rate limits

Testing: All changes are backward-compatible. Constraint validator additions only fail empty/trivial bodies.

Commits stacked:

…sResearch#24, NousResearch#26, NousResearch#35)

- PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions
  - _load_skill_body() splits frontmatter from body, body becomes instruction
  - _extract_evolved_instructions() extracts from signature.instructions (not wrapper)
  - constraint_validator.py: body/frontmatter separation — validate body has substance
  - dataset_builder.py: robust JSON parsing with 6 fallback strategies

- PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA

- PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto

Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls
conflicts with auto='light'. Use max_metric_calls alone (fixed).
…traint validator, JSON parsing robustness

Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35:
- skill_module.py: embed skill body in signature instructions via HTML sentinel
- evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging
- constraints.py: validate YAML frontmatter + substantive body content separately
- dataset_builder.py: 6-strategy JSON parser for LLM output resilience
- sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with <!-- ___SKILL_EVOLUTION_SENTINEL___ -->
innoscoutpro added a commit to innoscoutpro/hermes-agent-self-evolution that referenced this pull request Apr 27, 2026
…rmes_repo

Addresses security review findings C2, H1, H2, H5, M1.

evolution/core/external_importers.py
- SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token
  prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza,
  Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT
  3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL,
  HF_TOKEN. Generic api_key/secret/token/credential assignment patterns.
  Existing test cases (177) still pass — patterns relaxed where the test
  suite expected loose matching (short tokens, bare PRIVATE KEY).
- New scrub_secrets(text) helper for defence-in-depth scanning of
  outputs the model may have paraphrased into secret-shaped strings.

evolution/skills/skill_module.py
- find_skill rejects skill names containing path separators or shell
  metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector.
- find_skill resolves and refuses any SKILL.md whose real path lies
  outside the skills/ tree (symlink-escape protection, H5).
- Add SkillModule(treat_as_untrusted=True) preamble that tells the
  optimizer to treat skill body as DATA, not commands. Mitigates
  prompt-injection from third-party transcripts (C2).
- Switch body delimiter from "\n\n---\n" to HTML-comment sentinels
  (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal
  rules survive extraction (forward-port of upstream PR NousResearch#39 idea).

evolution/core/constraints.py
- run_test_suite(hermes_repo) now resolves the path, then refuses to
  invoke pytest unless pyproject.toml + tests/ exist and pyproject
  references hermes-agent. Pytest auto-loads conftest.py, so pointing
  at an untrusted tree was equivalent to RCE (M1).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
innoscoutpro added a commit to innoscoutpro/hermes-agent-self-evolution that referenced this pull request Apr 27, 2026
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports
the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial.

evolution/core/fitness.py
- Replace conciseness dimension with completeness — judges should
  penalise omissions, not reward brevity. Composite weight now
  0.4 correctness + 0.3 procedure + 0.3 completeness.
- New init_fitness_metric(config, skill_text, use_llm_judge=True) /
  reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge
  with the completeness rubric is the primary scorer; the deterministic
  multi-signal scorer becomes the fallback. When False (default), the
  metric stays purely deterministic and zero-cost — appropriate for
  fast iteration and for runs the user doesn't want to send to a judge.
- skill_fitness_metric accepts the 5-arg GEPA signature
  (gold, pred, trace, pred_name, pred_trace) so it works with both
  GEPA and the legacy 3-arg metric API.
- Judge failures fall through to deterministic with a "[judge
  unavailable: <ExceptionClass>]" prefix in feedback so users can see
  why scores look heuristic mid-run.

evolution/core/dataset_builder.py
- Replace inline 3-strategy JSON recovery with a 6-strategy
  _try_parse_json_list helper: direct json, ast.literal_eval (safer
  than eval, but parses Python-literal single-quoted dicts),
  array-extraction-then-parse, ast.literal_eval on extracted candidate,
  trailing-comma-and-quote-fix, markdown-fence stripping, and a
  last-resort per-block scan. Returns None instead of raising so the
  caller can produce a useful error.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@steezkelly
Copy link
Copy Markdown
Author

Architecture/maintainability audit after current local GEPA pipeline validation:

I recommend not merging this PR in its current shape as one large stacked change. The root fixes are valid and still important, but the PR currently bundles several separable concerns plus generated output artifacts.

Observed split points:

  1. Ghost-improvement / skill body extraction

    • Core bug: extraction must read the actual optimizable skill body/instructions, not truncate at markdown --- separators or wrapper text.
    • This is the highest-value fix and should be a small focused PR with regression tests.
  2. Provider/model routing robustness

    • Preserve full provider model IDs such as minimax/minimax-m2.7 instead of collapsing to bare names.
    • This should be separate because it affects credentials/provider routing, not skill extraction.
  3. GEPA API compatibility

    • DSPy GEPA argument constraints (max_metric_calls vs auto, reflection LM wiring, fallback behavior) should be a separate compatibility PR.
  4. Constraint validator / JSON robustness

    • Body substance validation and robust JSON parsing are useful, but separable from the extraction bug.
  5. Generated output/** artifacts

    • These should not be part of the code-fix PR unless maintainers explicitly want golden fixtures. If needed as test fixtures, move to a small named fixture directory and minimize them.

Current recommendation:

  • Keep this PR open as the discovery/audit branch for now.
  • Open a new focused PR first for item 1: ghost-improvement extraction regression + fix, with generated outputs removed.
  • Then follow with provider routing and GEPA API compatibility PRs.
  • Once the split PRs exist, close or supersede this mega-PR.

Current local evidence: the GEPA pipeline is unblocked locally with full model IDs (for example minimax/minimax-m2.7) and evolved skill extraction reads the predictor docstring directly rather than a truncated wrapper. Runs take roughly 3–8 minutes per skill in the working path.

@steezkelly
Copy link
Copy Markdown
Author

Closing this broad stacked PR as superseded for the ghost-improvement extraction fix by the focused PR #49.

The remaining useful pieces here should be split into separate, reviewable PRs rather than merged as one bundle:

  • provider/model routing and Nous API integration changes
  • GEPA API/threaded LM compatibility changes
  • constraint validator body-substance checks
  • JSON robustness fixes
  • generated output artifact cleanup

Keeping #38 open as the broader tracking issue; #49 now explicitly documents that it addresses only the ghost-improvement extraction portion.

@steezkelly steezkelly closed this May 5, 2026
@steezkelly steezkelly deleted the fix/ghost-improvement-bug branch May 5, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant