feat: improve skill_fitness_metric with multi-dimensional scoring by vominh1919 · Pull Request #28 · NousResearch/hermes-agent-self-evolution

vominh1919 · 2026-04-17T16:38:07Z

Summary

Replaces the single keyword-overlap scorer in skill_fitness_metric() with a weighted composite of five independent signals that spread scores across a much wider range.

Fixes #12

Problem

The original metric used only keyword overlap (len(expected & output) / len(expected)), producing scores in a narrow 37-49% band regardless of actual output quality.

Solution

Five scoring components, each providing independent signal:

Component	Weight	Purpose
Keyword overlap	25%	Stop-word filtered, F1-style recall/precision blend
Char 3-gram similarity	25%	Jaccard on character shingles — captures partial/substring matches
Structural pattern match	20%	Checks for code blocks, lists, headers, bold, URLs
Length quality	15%	Penalizes outputs too short or too long vs. expected
Content density	15%	Unique token ratio, avg token length, sentence variety

Additional changes

Returns dspy.Prediction(score=float, feedback=str) for GEPA reflective mutation compatibility
Feedback string highlights specific weaknesses for the optimizer
All scoring is deterministic (no LLM calls) — fast for optimization loops

Testing

The new metric produces varied scores across different output qualities rather than clustering in a narrow band.

Replaces the single keyword-overlap scorer with a weighted composite of five independent signals that spread scores across a much wider range: 1. Keyword overlap (25%) - stop-word filtered, F1-style blend 2. Character 3-gram similarity (25%) - Jaccard on char shingles 3. Structural pattern matching (20%) - code blocks, lists, headers 4. Length quality (15%) - proportional to expected output length 5. Content density (15%) - unique token ratio, avg token length, variety Also: - Returns dspy.Prediction(score=float, feedback=str) for GEPA reflective mutation compatibility - Feedback string highlights specific weaknesses for optimizer use - All scoring is deterministic (no LLM calls) for speed during optimization loops Fixes NousResearch#12

seilk mentioned this pull request Apr 26, 2026

fix: install missing optuna dep and add LLM request timeout #41

Open

innoscoutpro mentioned this pull request Apr 27, 2026

fix: integrate critical self-evolution pipeline fixes #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve skill_fitness_metric with multi-dimensional scoring#28

feat: improve skill_fitness_metric with multi-dimensional scoring#28
vominh1919 wants to merge 1 commit intoNousResearch:mainfrom
vominh1919:fix/improve-fitness-metric

vominh1919 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vominh1919 commented Apr 17, 2026

Summary

Problem

Solution

Additional changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant