Skip to content

feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker#25

Open
errusch wants to merge 2 commits intoNousResearch:mainfrom
errusch:feat/llmjudge-metric-observability
Open

feat: LLMJudge metric replaces keyword overlap + progress observability + usage-weighted picker#25
errusch wants to merge 2 commits intoNousResearch:mainfrom
errusch:feat/llmjudge-metric-observability

Conversation

@errusch
Copy link
Copy Markdown

@errusch errusch commented Apr 15, 2026

Problem

The keyword-overlap fitness metric (skill_fitness_metric) was gaming toward brevity. The optimizer would strip out API reference tables, curl examples, and detailed instructions because shorter outputs had higher keyword density ratios. The arxiv skill "improved" +31.9% while actually losing all its useful content.

Additionally, evolution runs were completely opaque — no way to observe progress during a long-running optimization.

Changes

1. LLM-as-judge replaces keyword overlap (evolution/core/fitness.py)

  • Conciseness → Completeness: The third scoring dimension now rewards thoroughness instead of brevity (40% correctness, 30% procedure following, 30% completeness)
  • Judge prompt explicitly instructs: "Do NOT reward brevity over thoroughness. A longer, detailed response that covers all necessary information is better than a terse one that skips important details."
  • init_fitness_metric(config, skill_text) wires the judge into the optimizer metric function
  • Holdout evaluation uses LLMJudge with full per-dimension breakdowns in the results table and metrics.json
  • Fallback heuristic adds coverage penalty (outputs < 50% of expected length get scaled down)

2. Progress observability (evolution/monitor/progress.py)

  • SQLite-backed run tracker: start_run, log_event, complete_run, fail_run
  • Every pipeline step logs events: loading, dataset generation, optimization, validation, holdout eval (with progress counter), reporting
  • DB at ~/.hermes/evolution_progress.db, queryable via get_active_run() and get_run_events()
  • Integrates with Hermes dashboard Skills tab for live run visibility

3. Usage-weighted skill picker (pick_skill.py)

  • Reads skill_usage.db (from Hermes agent instrumentation) to prioritize most-used skills
  • Never-evolved skills get priority; previously evolved sorted by staleness
  • 24h cooldown on recently failed skills
  • Supports batch selection: pick_skill.py -n 3

Testing

  • LLMJudge verified: terse output scored 0.18 composite (completeness 0.10) vs detailed output scored 0.72
  • Full pipeline test on arxiv skill: the optimizer produced a compressed variant that LLMJudge correctly rejected (-0.060 composite) where keyword overlap had previously reported +31.9%

Files changed

  • evolution/core/fitness.py — LLMJudge with completeness dimension, init_fitness_metric, coverage-penalized fallback
  • evolution/monitor/progress.py — new SQLite progress tracker
  • evolution/monitor/__init__.py — exports for progress module
  • evolution/skills/evolve_skill.py — wires in judge init, progress events at every step, LLMJudge holdout eval, dimension breakdowns
  • pick_skill.py — usage-weighted skill selector

errusch added 2 commits April 14, 2026 09:09
Three bug fixes that prevent the pipeline from running:

1. dataset_builder: LLM returns Python-style dicts (single quotes), not
   valid JSON. Added ast.literal_eval fallback + trailing comma fix so
   synthetic dataset generation doesn't crash on parse.

2. evolve_skill: GEPA API changed in DSPy 3.1.3 — max_steps is now
   max_metric_calls. Fixed the call and added auto='light'.

3. constraints: _check_skill_structure was checking the skill BODY for
   YAML frontmatter, which it never has after splitting. Rewrote to
   validate body structure (headings, procedural content, substance).

One architectural improvement:

4. skill_module: Skill text was passed as an input field, so the
   optimizer could never mutate it. Restructured to embed skill text
   in the instruction template via with_instructions(), allowing
   MIPROv2/GEPA to propose improved skill bodies. Updated extraction
   logic in evolve_skill.py to pull evolved text from the compiled
   predictor's instruction.
… picker

Three improvements to the self-evolution pipeline:

1. Replace keyword-overlap fitness with LLM-as-judge scoring
   - Conciseness dimension replaced with completeness (40% correctness,
     30% procedure following, 30% completeness)
   - Judge explicitly penalizes omissions of API refs, examples, edge cases
   - Fallback heuristic also adds coverage penalty for terse outputs
   - init_fitness_metric() wires the judge into the optimizer's metric fn
   - Holdout eval uses LLMJudge with full dimension breakdowns

2. SQLite-backed progress tracking
   - evolution/monitor/progress.py: start_run, log_event, complete_run, fail_run
   - Every pipeline step now logs events (loading, dataset gen, optimization,
     validation, holdout eval with progress, reporting)
   - DB at ~/.hermes/evolution_progress.db, queryable via get_active_run()
   - Integrates with Hermes dashboard Skills tab for live run visibility

3. Usage-weighted skill picker (pick_skill.py)
   - Reads skill_usage.db (from Hermes agent instrumentation) to prioritize
     most-used skills
   - Never-evolved skills get priority; previously evolved sorted by staleness
   - 24h cooldown on recently failed skills
   - Supports batch selection: pick_skill.py -n 3

Tested: arxiv skill evolution with LLMJudge correctly rejected a compressed
variant that keyword-overlap had previously rewarded (+31.9% fake improvement).
innoscoutpro added a commit to innoscoutpro/hermes-agent-self-evolution that referenced this pull request Apr 27, 2026
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports
the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial.

evolution/core/fitness.py
- Replace conciseness dimension with completeness — judges should
  penalise omissions, not reward brevity. Composite weight now
  0.4 correctness + 0.3 procedure + 0.3 completeness.
- New init_fitness_metric(config, skill_text, use_llm_judge=True) /
  reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge
  with the completeness rubric is the primary scorer; the deterministic
  multi-signal scorer becomes the fallback. When False (default), the
  metric stays purely deterministic and zero-cost — appropriate for
  fast iteration and for runs the user doesn't want to send to a judge.
- skill_fitness_metric accepts the 5-arg GEPA signature
  (gold, pred, trace, pred_name, pred_trace) so it works with both
  GEPA and the legacy 3-arg metric API.
- Judge failures fall through to deterministic with a "[judge
  unavailable: <ExceptionClass>]" prefix in feedback so users can see
  why scores look heuristic mid-run.

evolution/core/dataset_builder.py
- Replace inline 3-strategy JSON recovery with a 6-strategy
  _try_parse_json_list helper: direct json, ast.literal_eval (safer
  than eval, but parses Python-literal single-quoted dicts),
  array-extraction-then-parse, ast.literal_eval on extracted candidate,
  trailing-comma-and-quote-fix, markdown-fence stripping, and a
  last-resort per-block scan. Returns None instead of raising so the
  caller can produce a useful error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant