Skip to content

Fitness metric uses keyword overlap only — insufficient signal for optimization #12

@adamkrawczyk

Description

@adamkrawczyk

Issue

skill_fitness_metric() in fitness.py uses keyword overlap between expected and actual output as its only scoring mechanism. This produces scores in a narrow 37-49% band regardless of output quality, giving the optimizer almost no signal to differentiate good from bad variants.

Current behavior

# fitness.py — keyword overlap scoring
expected_words = set(expected_lower.split())
output_words = set(output_lower.split())
overlap = len(expected_words & output_words) / len(expected_words)
score = 0.3 + (0.7 * overlap)

Observed scores across 10 trials: 37.3%, 41.7%, 46.2%, 37.3%, 41.6%, 36.9%, 46.2%, 41.6%, 41.7%, 41.6% — too narrow for meaningful optimization.

Additional issue: GEPA compatibility

The metric returns float, but GEPA's GEPAFeedbackMetric protocol expects the metric to optionally return dspy.Prediction(score=float, feedback=str) for trace-aware reflective mutation. Without feedback, GEPA falls back to generic score-only feedback ("This trajectory got a score of {score}.").

Note

The codebase already has a full LLMJudge class with rubric-based evaluation that is never wired into the main metric. This issue is about connecting that capability.

Suggested fix

  1. Use LLM-as-judge (via the configured dspy.LM) as the primary scoring mechanism
  2. Return dspy.Prediction(score=float, feedback=str) which is compatible with both:
    • GEPA (reads feedback for reflective mutation)
    • MIPROv2 (extracts float via __float__())
  3. Keep keyword overlap as a fallback when LLM calls fail (offline/rate-limited scenarios)

Environment

  • dspy 3.1.3
  • Tested with Anthropic (claude-sonnet-4-6, claude-haiku-4-5) and compatible with any LiteLLM-supported model

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions