Fitness metric uses keyword overlap only — insufficient signal for optimization

## Issue

`skill_fitness_metric()` in `fitness.py` uses keyword overlap between expected and actual output as its only scoring mechanism. This produces scores in a narrow 37-49% band regardless of output quality, giving the optimizer almost no signal to differentiate good from bad variants.

## Current behavior

```python
# fitness.py — keyword overlap scoring
expected_words = set(expected_lower.split())
output_words = set(output_lower.split())
overlap = len(expected_words & output_words) / len(expected_words)
score = 0.3 + (0.7 * overlap)
```

Observed scores across 10 trials: 37.3%, 41.7%, 46.2%, 37.3%, 41.6%, 36.9%, 46.2%, 41.6%, 41.7%, 41.6% — too narrow for meaningful optimization.

## Additional issue: GEPA compatibility

The metric returns `float`, but GEPA's `GEPAFeedbackMetric` protocol expects the metric to optionally return `dspy.Prediction(score=float, feedback=str)` for trace-aware reflective mutation. Without feedback, GEPA falls back to generic score-only feedback (`"This trajectory got a score of {score}."`).

## Note

The codebase already has a full `LLMJudge` class with rubric-based evaluation that is never wired into the main metric. This issue is about connecting that capability.

## Suggested fix

1. Use LLM-as-judge (via the configured `dspy.LM`) as the primary scoring mechanism
2. Return `dspy.Prediction(score=float, feedback=str)` which is compatible with both:
   - GEPA (reads feedback for reflective mutation)
   - MIPROv2 (extracts float via `__float__()`)
3. Keep keyword overlap as a fallback when LLM calls fail (offline/rate-limited scenarios)

## Environment
- dspy 3.1.3
- Tested with Anthropic (claude-sonnet-4-6, claude-haiku-4-5) and compatible with any LiteLLM-supported model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fitness metric uses keyword overlap only — insufficient signal for optimization #12

Issue

Current behavior

Additional issue: GEPA compatibility

Note

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fitness metric uses keyword overlap only — insufficient signal for optimization #12

Description

Issue

Current behavior

Additional issue: GEPA compatibility

Note

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions