feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker#25
Open
errusch wants to merge 2 commits intoNousResearch:mainfrom
Open
Conversation
Three bug fixes that prevent the pipeline from running: 1. dataset_builder: LLM returns Python-style dicts (single quotes), not valid JSON. Added ast.literal_eval fallback + trailing comma fix so synthetic dataset generation doesn't crash on parse. 2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now max_metric_calls. Fixed the call and added auto='light'. 3. constraints: _check_skill_structure was checking the skill BODY for YAML frontmatter, which it never has after splitting. Rewrote to validate body structure (headings, procedural content, substance). One architectural improvement: 4. skill_module: Skill text was passed as an input field, so the optimizer could never mutate it. Restructured to embed skill text in the instruction template via with_instructions(), allowing MIPROv2/GEPA to propose improved skill bodies. Updated extraction logic in evolve_skill.py to pull evolved text from the compiled predictor's instruction.
… picker
Three improvements to the self-evolution pipeline:
1. Replace keyword-overlap fitness with LLM-as-judge scoring
- Conciseness dimension replaced with completeness (40% correctness,
30% procedure following, 30% completeness)
- Judge explicitly penalizes omissions of API refs, examples, edge cases
- Fallback heuristic also adds coverage penalty for terse outputs
- init_fitness_metric() wires the judge into the optimizer's metric fn
- Holdout eval uses LLMJudge with full dimension breakdowns
2. SQLite-backed progress tracking
- evolution/monitor/progress.py: start_run, log_event, complete_run, fail_run
- Every pipeline step now logs events (loading, dataset gen, optimization,
validation, holdout eval with progress, reporting)
- DB at ~/.hermes/evolution_progress.db, queryable via get_active_run()
- Integrates with Hermes dashboard Skills tab for live run visibility
3. Usage-weighted skill picker (pick_skill.py)
- Reads skill_usage.db (from Hermes agent instrumentation) to prioritize
most-used skills
- Never-evolved skills get priority; previously evolved sorted by staleness
- 24h cooldown on recently failed skills
- Supports batch selection: pick_skill.py -n 3
Tested: arxiv skill evolution with LLMJudge correctly rejected a compressed
variant that keyword-overlap had previously rewarded (+31.9% fake improvement).
innoscoutpro
added a commit
to innoscoutpro/hermes-agent-self-evolution
that referenced
this pull request
Apr 27, 2026
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The keyword-overlap fitness metric (
skill_fitness_metric) was gaming toward brevity. The optimizer would strip out API reference tables, curl examples, and detailed instructions because shorter outputs had higher keyword density ratios. The arxiv skill "improved" +31.9% while actually losing all its useful content.Additionally, evolution runs were completely opaque — no way to observe progress during a long-running optimization.
Changes
1. LLM-as-judge replaces keyword overlap (
evolution/core/fitness.py)init_fitness_metric(config, skill_text)wires the judge into the optimizer metric function2. Progress observability (
evolution/monitor/progress.py)start_run,log_event,complete_run,fail_run~/.hermes/evolution_progress.db, queryable viaget_active_run()andget_run_events()3. Usage-weighted skill picker (
pick_skill.py)skill_usage.db(from Hermes agent instrumentation) to prioritize most-used skillspick_skill.py -n 3Testing
Files changed
evolution/core/fitness.py— LLMJudge with completeness dimension, init_fitness_metric, coverage-penalized fallbackevolution/monitor/progress.py— new SQLite progress trackerevolution/monitor/__init__.py— exports for progress moduleevolution/skills/evolve_skill.py— wires in judge init, progress events at every step, LLMJudge holdout eval, dimension breakdownspick_skill.py— usage-weighted skill selector