Commit 54c8098
feat: LLMJudge with completeness rubric + 6-strategy JSON parser
Addresses upstream issue NousResearch#33 (C1: keyword-only metric) and forward-ports
the more polished pieces of upstream PR NousResearch#25 and PR NousResearch#39 partial.
evolution/core/fitness.py
- Replace conciseness dimension with completeness — judges should
penalise omissions, not reward brevity. Composite weight now
0.4 correctness + 0.3 procedure + 0.3 completeness.
- New init_fitness_metric(config, skill_text, use_llm_judge=True) /
reset_fitness_metric() pair. When use_llm_judge=True, an LLMJudge
with the completeness rubric is the primary scorer; the deterministic
multi-signal scorer becomes the fallback. When False (default), the
metric stays purely deterministic and zero-cost — appropriate for
fast iteration and for runs the user doesn't want to send to a judge.
- skill_fitness_metric accepts the 5-arg GEPA signature
(gold, pred, trace, pred_name, pred_trace) so it works with both
GEPA and the legacy 3-arg metric API.
- Judge failures fall through to deterministic with a "[judge
unavailable: <ExceptionClass>]" prefix in feedback so users can see
why scores look heuristic mid-run.
evolution/core/dataset_builder.py
- Replace inline 3-strategy JSON recovery with a 6-strategy
_try_parse_json_list helper: direct json, ast.literal_eval (safer
than eval, but parses Python-literal single-quoted dicts),
array-extraction-then-parse, ast.literal_eval on extracted candidate,
trailing-comma-and-quote-fix, markdown-fence stripping, and a
last-resort per-block scan. Returns None instead of raising so the
caller can produce a useful error.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>1 parent 359ccca commit 54c8098
2 files changed
Lines changed: 214 additions & 148 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
6 | 6 | | |
7 | 7 | | |
8 | 8 | | |
| 9 | + | |
9 | 10 | | |
10 | 11 | | |
| 12 | + | |
11 | 13 | | |
12 | 14 | | |
13 | 15 | | |
| |||
17 | 19 | | |
18 | 20 | | |
19 | 21 | | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
| 40 | + | |
| 41 | + | |
| 42 | + | |
| 43 | + | |
| 44 | + | |
| 45 | + | |
| 46 | + | |
| 47 | + | |
| 48 | + | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
| 79 | + | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
| 91 | + | |
| 92 | + | |
| 93 | + | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
20 | 100 | | |
21 | 101 | | |
22 | 102 | | |
| |||
132 | 212 | | |
133 | 213 | | |
134 | 214 | | |
135 | | - | |
136 | | - | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
137 | 218 | | |
138 | | - | |
139 | | - | |
140 | | - | |
141 | | - | |
142 | | - | |
143 | | - | |
144 | | - | |
145 | | - | |
146 | | - | |
147 | | - | |
148 | | - | |
149 | | - | |
150 | | - | |
151 | | - | |
152 | | - | |
153 | | - | |
154 | | - | |
155 | | - | |
156 | | - | |
157 | | - | |
158 | | - | |
159 | | - | |
160 | | - | |
161 | | - | |
162 | | - | |
163 | | - | |
164 | | - | |
165 | | - | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
170 | | - | |
171 | | - | |
172 | | - | |
173 | | - | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | | - | |
180 | | - | |
181 | | - | |
182 | | - | |
183 | | - | |
184 | | - | |
| 219 | + | |
| 220 | + | |
| 221 | + | |
| 222 | + | |
| 223 | + | |
185 | 224 | | |
186 | 225 | | |
187 | 226 | | |
| |||
0 commit comments