Skip to content

feat: improve skill_fitness_metric with multi-dimensional scoring#28

Open
vominh1919 wants to merge 1 commit intoNousResearch:mainfrom
vominh1919:fix/improve-fitness-metric
Open

feat: improve skill_fitness_metric with multi-dimensional scoring#28
vominh1919 wants to merge 1 commit intoNousResearch:mainfrom
vominh1919:fix/improve-fitness-metric

Conversation

@vominh1919
Copy link
Copy Markdown

Summary

Replaces the single keyword-overlap scorer in skill_fitness_metric() with a weighted composite of five independent signals that spread scores across a much wider range.

Fixes #12

Problem

The original metric used only keyword overlap (len(expected & output) / len(expected)), producing scores in a narrow 37-49% band regardless of actual output quality.

Solution

Five scoring components, each providing independent signal:

Component Weight Purpose
Keyword overlap 25% Stop-word filtered, F1-style recall/precision blend
Char 3-gram similarity 25% Jaccard on character shingles — captures partial/substring matches
Structural pattern match 20% Checks for code blocks, lists, headers, bold, URLs
Length quality 15% Penalizes outputs too short or too long vs. expected
Content density 15% Unique token ratio, avg token length, sentence variety

Additional changes

  • Returns dspy.Prediction(score=float, feedback=str) for GEPA reflective mutation compatibility
  • Feedback string highlights specific weaknesses for the optimizer
  • All scoring is deterministic (no LLM calls) — fast for optimization loops

Testing

The new metric produces varied scores across different output qualities rather than clustering in a narrow band.

Replaces the single keyword-overlap scorer with a weighted composite of
five independent signals that spread scores across a much wider range:

1. Keyword overlap (25%) - stop-word filtered, F1-style blend
2. Character 3-gram similarity (25%) - Jaccard on char shingles
3. Structural pattern matching (20%) - code blocks, lists, headers
4. Length quality (15%) - proportional to expected output length
5. Content density (15%) - unique token ratio, avg token length, variety

Also:
- Returns dspy.Prediction(score=float, feedback=str) for GEPA
  reflective mutation compatibility
- Feedback string highlights specific weaknesses for optimizer use
- All scoring is deterministic (no LLM calls) for speed during
  optimization loops

Fixes NousResearch#12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fitness metric uses keyword overlap only — insufficient signal for optimization

1 participant