Issue
skill_fitness_metric() in fitness.py uses keyword overlap between expected and actual output as its only scoring mechanism. This produces scores in a narrow 37-49% band regardless of output quality, giving the optimizer almost no signal to differentiate good from bad variants.
Current behavior
# fitness.py — keyword overlap scoring
expected_words = set(expected_lower.split())
output_words = set(output_lower.split())
overlap = len(expected_words & output_words) / len(expected_words)
score = 0.3 + (0.7 * overlap)
Observed scores across 10 trials: 37.3%, 41.7%, 46.2%, 37.3%, 41.6%, 36.9%, 46.2%, 41.6%, 41.7%, 41.6% — too narrow for meaningful optimization.
Additional issue: GEPA compatibility
The metric returns float, but GEPA's GEPAFeedbackMetric protocol expects the metric to optionally return dspy.Prediction(score=float, feedback=str) for trace-aware reflective mutation. Without feedback, GEPA falls back to generic score-only feedback ("This trajectory got a score of {score}.").
Note
The codebase already has a full LLMJudge class with rubric-based evaluation that is never wired into the main metric. This issue is about connecting that capability.
Suggested fix
- Use LLM-as-judge (via the configured
dspy.LM) as the primary scoring mechanism
- Return
dspy.Prediction(score=float, feedback=str) which is compatible with both:
- GEPA (reads feedback for reflective mutation)
- MIPROv2 (extracts float via
__float__())
- Keep keyword overlap as a fallback when LLM calls fail (offline/rate-limited scenarios)
Environment
- dspy 3.1.3
- Tested with Anthropic (claude-sonnet-4-6, claude-haiku-4-5) and compatible with any LiteLLM-supported model
Issue
skill_fitness_metric()infitness.pyuses keyword overlap between expected and actual output as its only scoring mechanism. This produces scores in a narrow 37-49% band regardless of output quality, giving the optimizer almost no signal to differentiate good from bad variants.Current behavior
Observed scores across 10 trials: 37.3%, 41.7%, 46.2%, 37.3%, 41.6%, 36.9%, 46.2%, 41.6%, 41.7%, 41.6% — too narrow for meaningful optimization.
Additional issue: GEPA compatibility
The metric returns
float, but GEPA'sGEPAFeedbackMetricprotocol expects the metric to optionally returndspy.Prediction(score=float, feedback=str)for trace-aware reflective mutation. Without feedback, GEPA falls back to generic score-only feedback ("This trajectory got a score of {score}.").Note
The codebase already has a full
LLMJudgeclass with rubric-based evaluation that is never wired into the main metric. This issue is about connecting that capability.Suggested fix
dspy.LM) as the primary scoring mechanismdspy.Prediction(score=float, feedback=str)which is compatible with both:__float__())Environment