feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker by errusch · Pull Request #25 · NousResearch/hermes-agent-self-evolution

errusch · 2026-04-15T05:03:47Z

Problem

The keyword-overlap fitness metric (skill_fitness_metric) was gaming toward brevity. The optimizer would strip out API reference tables, curl examples, and detailed instructions because shorter outputs had higher keyword density ratios. The arxiv skill "improved" +31.9% while actually losing all its useful content.

Additionally, evolution runs were completely opaque — no way to observe progress during a long-running optimization.

Changes

1. LLM-as-judge replaces keyword overlap (`evolution/core/fitness.py`)

Conciseness → Completeness: The third scoring dimension now rewards thoroughness instead of brevity (40% correctness, 30% procedure following, 30% completeness)
Judge prompt explicitly instructs: "Do NOT reward brevity over thoroughness. A longer, detailed response that covers all necessary information is better than a terse one that skips important details."
init_fitness_metric(config, skill_text) wires the judge into the optimizer metric function
Holdout evaluation uses LLMJudge with full per-dimension breakdowns in the results table and metrics.json
Fallback heuristic adds coverage penalty (outputs < 50% of expected length get scaled down)

2. Progress observability (`evolution/monitor/progress.py`)

SQLite-backed run tracker: start_run, log_event, complete_run, fail_run
Every pipeline step logs events: loading, dataset generation, optimization, validation, holdout eval (with progress counter), reporting
DB at ~/.hermes/evolution_progress.db, queryable via get_active_run() and get_run_events()
Integrates with Hermes dashboard Skills tab for live run visibility

3. Usage-weighted skill picker (`pick_skill.py`)

Reads skill_usage.db (from Hermes agent instrumentation) to prioritize most-used skills
Never-evolved skills get priority; previously evolved sorted by staleness
24h cooldown on recently failed skills
Supports batch selection: pick_skill.py -n 3

Testing

LLMJudge verified: terse output scored 0.18 composite (completeness 0.10) vs detailed output scored 0.72
Full pipeline test on arxiv skill: the optimizer produced a compressed variant that LLMJudge correctly rejected (-0.060 composite) where keyword overlap had previously reported +31.9%

Files changed

evolution/core/fitness.py — LLMJudge with completeness dimension, init_fitness_metric, coverage-penalized fallback
evolution/monitor/progress.py — new SQLite progress tracker
evolution/monitor/__init__.py — exports for progress module
evolution/skills/evolve_skill.py — wires in judge init, progress events at every step, LLMJudge holdout eval, dimension breakdowns
pick_skill.py — usage-weighted skill selector

Three bug fixes that prevent the pipeline from running: 1. dataset_builder: LLM returns Python-style dicts (single quotes), not valid JSON. Added ast.literal_eval fallback + trailing comma fix so synthetic dataset generation doesn't crash on parse. 2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now max_metric_calls. Fixed the call and added auto='light'. 3. constraints: _check_skill_structure was checking the skill BODY for YAML frontmatter, which it never has after splitting. Rewrote to validate body structure (headings, procedural content, substance). One architectural improvement: 4. skill_module: Skill text was passed as an input field, so the optimizer could never mutate it. Restructured to embed skill text in the instruction template via with_instructions(), allowing MIPROv2/GEPA to propose improved skill bodies. Updated extraction logic in evolve_skill.py to pull evolved text from the compiled predictor's instruction.

… picker Three improvements to the self-evolution pipeline: 1. Replace keyword-overlap fitness with LLM-as-judge scoring - Conciseness dimension replaced with completeness (40% correctness, 30% procedure following, 30% completeness) - Judge explicitly penalizes omissions of API refs, examples, edge cases - Fallback heuristic also adds coverage penalty for terse outputs - init_fitness_metric() wires the judge into the optimizer's metric fn - Holdout eval uses LLMJudge with full dimension breakdowns 2. SQLite-backed progress tracking - evolution/monitor/progress.py: start_run, log_event, complete_run, fail_run - Every pipeline step now logs events (loading, dataset gen, optimization, validation, holdout eval with progress, reporting) - DB at ~/.hermes/evolution_progress.db, queryable via get_active_run() - Integrates with Hermes dashboard Skills tab for live run visibility 3. Usage-weighted skill picker (pick_skill.py) - Reads skill_usage.db (from Hermes agent instrumentation) to prioritize most-used skills - Never-evolved skills get priority; previously evolved sorted by staleness - 24h cooldown on recently failed skills - Supports batch selection: pick_skill.py -n 3 Tested: arxiv skill evolution with LLMJudge correctly rejected a compressed variant that keyword-overlap had previously rewarded (+31.9% fake improvement).

Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial. evolution/core/fitness.py - Replace conciseness dimension with completeness — judges should penalise omissions, not reward brevity. Composite weight now 0.4 correctness + 0.3 procedure + 0.3 completeness. - New init_fitness_metric(config, skill_text, use_llm_judge=True) / reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge with the completeness rubric is the primary scorer; the deterministic multi-signal scorer becomes the fallback. When False (default), the metric stays purely deterministic and zero-cost — appropriate for fast iteration and for runs the user doesn't want to send to a judge. - skill_fitness_metric accepts the 5-arg GEPA signature (gold, pred, trace, pred_name, pred_trace) so it works with both GEPA and the legacy 3-arg metric API. - Judge failures fall through to deterministic with a "[judge unavailable: <ExceptionClass>]" prefix in feedback so users can see why scores look heuristic mid-run. evolution/core/dataset_builder.py - Replace inline 3-strategy JSON recovery with a 6-strategy _try_parse_json_list helper: direct json, ast.literal_eval (safer than eval, but parses Python-literal single-quoted dicts), array-extraction-then-parse, ast.literal_eval on extracted candidate, trailing-comma-and-quote-fix, markdown-fence stripping, and a last-resort per-block scan. Returns None instead of raising so the caller can produce a useful error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

errusch added 2 commits April 14, 2026 09:09

seilk mentioned this pull request Apr 26, 2026

fix: install missing optuna dep and add LLM request timeout #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker#25

feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker#25
errusch wants to merge 2 commits intoNousResearch:mainfrom
errusch:feat/llmjudge-metric-observability

errusch commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

errusch commented Apr 15, 2026

Problem

Changes

1. LLM-as-judge replaces keyword overlap (evolution/core/fitness.py)

2. Progress observability (evolution/monitor/progress.py)

3. Usage-weighted skill picker (pick_skill.py)

Testing

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. LLM-as-judge replaces keyword overlap (`evolution/core/fitness.py`)

2. Progress observability (`evolution/monitor/progress.py`)

3. Usage-weighted skill picker (`pick_skill.py`)