fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness by steezkelly · Pull Request #39 · NousResearch/hermes-agent-self-evolution

steezkelly · 2026-04-25T03:54:49Z

Summary

Fixes 4 interconnected bugs discovered during Hermes skill evolution pipeline testing:

Bug 1: Ghost-improvement — skill body extraction losing 89% of content

Root cause: Extraction used \n\n---\n as separator between skill body and wrapper instructions. But skill bodies contain --- as section dividers (5+ times in systematic-debugging skill). Split happened at char 1,101, losing 8,919 of 10,020 chars.

Fix: Replace separator with an HTML comment sentinel that can never appear in markdown.

Files: evolution/skills/skill_module.py, evolution/skills/evolve_skill.py

Bug 2: GEPA API conflict in DSPy 3.2.0

Root cause: max_metric_calls + auto="light" are mutually exclusive in DSPy 3.2.0 GEPA implementation.

Fix: Removed auto="light" from GEPA call. Added reflection_lm for proper meta-optimization.

Bug 3: Constraint validator only checked frontmatter

Root cause: _check_skill_structure validated YAML frontmatter fields but never checked whether the markdown body was substantive.

Fix: Added body validation requiring at least 2 of 3: headings, procedural content, substantial length. Also validates the full reassembled skill.

Bug 4: JSON parser brittle under LLM malformation

Root cause: DSPy's dataset_builder used raw json.loads() which fails on common LLM output patterns (trailing commas, single quotes, markdown fences).

Fix: Added _try_parse_json() with 6 fallback strategies including ast.literal_eval, regex extraction, trailing-comma fixing, and markdown fence stripping.

Additional improvements

OPENROUTER_BASE_URL env var support for all DSPy LM initialization calls
max_skill_size increased 15KB to 50KB for evolved skills with few-shot examples
run-evolution.sh helper script for provider selection (Nous/OpenRouter)
skill_fitness_metric fixed for 5-arg GEPA compatibility
GEPA fallback to MIPROv2 with num_threads=1 to avoid rate limits

Testing: All changes are backward-compatible. Constraint validator additions only fail empty/trivial bodies.

Commits stacked:

7306d82 Apply merged upstream PRs: skill text as optimizable instruction (fix: runtime bugs + make skill text optimizable by DSPy #24, Real session data evolution: fix GEPA + expand sessiondb filter #26, fix: correct GEPA arg, constraint input, and rate limit handling in evolve_skill #35)
4844a40 fix: ghost-improvement bug — HTML sentinel extraction, GEPA API, constraint validator, JSON parsing robustness

…sResearch#24, NousResearch#26, NousResearch#35) - PR NousResearch#24: skill_module.py stores skill body as InputField → signature.instructions - _load_skill_body() splits frontmatter from body, body becomes instruction - _extract_evolved_instructions() extracts from signature.instructions (not wrapper) - constraint_validator.py: body/frontmatter separation — validate body has substance - dataset_builder.py: robust JSON parsing with 6 fallback strategies - PR NousResearch#26: GEPA wiring fix — reflection_lm passed to GEPA - PR NousResearch#35: constraint validator for GEPA args, max_metric_calls not mixed with auto Note: GEPA still falls back to MIPROv2 due to DSPy 3.2.0 API — max_metric_calls conflicts with auto='light'. Use max_metric_calls alone (fixed).

…traint validator, JSON parsing robustness Combined patch applying upstream PRs NousResearch#24/NousResearch#26/NousResearch#35: - skill_module.py: embed skill body in signature instructions via HTML sentinel - evolve_skill.py: HTML sentinel extraction with fallback, GEPA max_metric_calls fix, improved messaging - constraints.py: validate YAML frontmatter + substantive body content separately - dataset_builder.py: 6-strategy JSON parser for LLM output resilience - sentinel collision: replaced \n\n---\n\n (appears in skill bodies) with

…rmes_repo Addresses security review findings C2, H1, H2, H5, M1. evolution/core/external_importers.py - SECRET_PATTERNS: add gho_/ghs_/ghr_, GitLab glpat-, all Slack token prefixes (xoxp/xoxa/xoxr/xoxs/xapp/xoxb), AWS ASIA, Google AIza, Stripe live/test variants (rk_/pk_), Twilio, SendGrid, Mailgun, JWT 3-part, all-algo private-key headers, MINIMAX_API_KEY, REDIS_URL, HF_TOKEN. Generic api_key/secret/token/credential assignment patterns. Existing test cases (177) still pass — patterns relaxed where the test suite expected loose matching (short tokens, bare PRIVATE KEY). - New scrub_secrets(text) helper for defence-in-depth scanning of outputs the model may have paraphrased into secret-shaped strings. evolution/skills/skill_module.py - find_skill rejects skill names containing path separators or shell metachars (^[A-Za-z0-9_.-]+$ guard) — closes ../traversal vector. - find_skill resolves and refuses any SKILL.md whose real path lies outside the skills/ tree (symlink-escape protection, H5). - Add SkillModule(treat_as_untrusted=True) preamble that tells the optimizer to treat skill body as DATA, not commands. Mitigates prompt-injection from third-party transcripts (C2). - Switch body delimiter from "\n\n---\n" to HTML-comment sentinels (HERMES_SKILL_BODY_START/END) so bodies containing markdown horizontal rules survive extraction (forward-port of upstream PR NousResearch#39 idea). evolution/core/constraints.py - run_test_suite(hermes_repo) now resolves the path, then refuses to invoke pytest unless pyproject.toml + tests/ exist and pyproject references hermes-agent. Pytest auto-loads conftest.py, so pointing at an untrusted tree was equivalent to RCE (M1). Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>

steezkelly · 2026-05-04T22:31:42Z

Architecture/maintainability audit after current local GEPA pipeline validation:

I recommend not merging this PR in its current shape as one large stacked change. The root fixes are valid and still important, but the PR currently bundles several separable concerns plus generated output artifacts.

Observed split points:

Ghost-improvement / skill body extraction
- Core bug: extraction must read the actual optimizable skill body/instructions, not truncate at markdown --- separators or wrapper text.
- This is the highest-value fix and should be a small focused PR with regression tests.
Provider/model routing robustness
- Preserve full provider model IDs such as minimax/minimax-m2.7 instead of collapsing to bare names.
- This should be separate because it affects credentials/provider routing, not skill extraction.
GEPA API compatibility
- DSPy GEPA argument constraints (max_metric_calls vs auto, reflection LM wiring, fallback behavior) should be a separate compatibility PR.
Constraint validator / JSON robustness
- Body substance validation and robust JSON parsing are useful, but separable from the extraction bug.
Generated output/** artifacts
- These should not be part of the code-fix PR unless maintainers explicitly want golden fixtures. If needed as test fixtures, move to a small named fixture directory and minimize them.

Current recommendation:

Keep this PR open as the discovery/audit branch for now.
Open a new focused PR first for item 1: ghost-improvement extraction regression + fix, with generated outputs removed.
Then follow with provider routing and GEPA API compatibility PRs.
Once the split PRs exist, close or supersede this mega-PR.

Current local evidence: the GEPA pipeline is unblocked locally with full model IDs (for example minimax/minimax-m2.7) and evolved skill extraction reads the predictor docstring directly rather than a truncated wrapper. Runs take roughly 3–8 minutes per skill in the working path.

steezkelly · 2026-05-05T17:33:26Z

Closing this broad stacked PR as superseded for the ghost-improvement extraction fix by the focused PR #49.

The remaining useful pieces here should be split into separate, reviewable PRs rather than merged as one bundle:

provider/model routing and Nous API integration changes
GEPA API/threaded LM compatibility changes
constraint validator body-substance checks
JSON robustness fixes
generated output artifact cleanup

Keeping #38 open as the broader tracking issue; #49 now explicitly documents that it addresses only the ghost-improvement extraction portion.

steezkelly added 2 commits April 24, 2026 22:44

seilk mentioned this pull request Apr 26, 2026

fix: install missing optuna dep and add LLM request timeout #41

Open

This was referenced May 4, 2026

Phase 1 SkillModule architecture prevents GEPA from mutating actual skill content + Nous API integration patches #38

Open

fix: extract evolved skill body from optimized predictor #49

Open

steezkelly closed this May 5, 2026

steezkelly deleted the fix/ghost-improvement-bug branch May 5, 2026 17:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39

fix: ghost-improvement extraction bug + GEPA API + constraint validator + JSON robustness#39
steezkelly wants to merge 2 commits intoNousResearch:mainfrom
steezkelly:fix/ghost-improvement-bug

steezkelly commented Apr 25, 2026

Uh oh!

steezkelly commented May 4, 2026

Uh oh!

steezkelly commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

steezkelly commented Apr 25, 2026

Summary

Bug 1: Ghost-improvement — skill body extraction losing 89% of content

Bug 2: GEPA API conflict in DSPy 3.2.0

Bug 3: Constraint validator only checked frontmatter

Bug 4: JSON parser brittle under LLM malformation

Additional improvements

Uh oh!

steezkelly commented May 4, 2026

Uh oh!

steezkelly commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant