Fix nightly evolution reliability and enforce real skill mutation#17
Open
kirniy wants to merge 1 commit intoNousResearch:mainfrom
Open
Fix nightly evolution reliability and enforce real skill mutation#17kirniy wants to merge 1 commit intoNousResearch:mainfrom
kirniy wants to merge 1 commit intoNousResearch:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What\n- add reliable nightly runner scripts with key-pool rotation\n- reduce nightly default iterations to 1 to avoid long optimizer overruns\n- add anti-gaming mutation gate and fallback body rewrite when DSPy only mutates prompt internals\n- evaluate evolved body via a dedicated SkillModule and persist mutation metadata\n\n## Why\nNightly runs were repeatedly failing or producing non-promotable no-op candidates (). This patch makes the pipeline produce real, auditable skill mutations and stable nightly artifacts.\n\n## Validation\n- reset pool + run Running nightly skill evolution: github-code-review
Repo: /Users/kirniy/dev/hermes-agent-self-evolution
Hermes repo: /Users/kirniy/.hermes/hermes-agent
Pool size: 4
Command: python Command: -m Command: evolution.skills.evolve_skill Command: --skill Command: github-code-review Command: --iterations Command: 1 Command: --eval-source Command: synthetic Command: --optimizer-model Command: gemini/gemini-3.1-pro-preview Command: --eval-model Command: gemini/gemini-3.1-pro-preview
Attempt 1/4 with key GOOGLE_API_KEY_3
🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review
Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...
Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout
Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)
Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview
Running GEPA optimization (1 iterations)...
GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Error getting data summary: litellm.BadRequestError: GeminiException BadRequestError - {
"error": {
"code": 400,
"message": "API key expired. Please renew the API key.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "API_KEY_INVALID",
"domain": "googleapis.com",
"metadata": {
"service": "generativelanguage.googleapis.com"
}
},
{
"@type": "type.googleapis.com/google.rpc.LocalizedMessage",
"locale": "en-US",
"message": "API key expired. Please renew the API key."
}
]
}
}
.
Running without data aware proposer.
Attempt 1 failed with exit 1
Attempt 2/4 with key GOOGLE_API_KEY_2
🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review
Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...
Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout
Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)
Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview
Running GEPA optimization (1 iterations)...
GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.50 / 1 (49.6%): 0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 1.14 / 2 (56.8%): 12%|█▎ | 1/8 [00:00<00:00, 46.53it/s]
Average Metric: 1.79 / 3 (59.5%): 25%|██▌ | 2/8 [00:00<00:00, 91.00it/s]
Average Metric: 2.31 / 4 (57.8%): 38%|███▊ | 3/8 [00:00<00:00, 134.36it/s]
Average Metric: 2.91 / 5 (58.2%): 50%|█████ | 4/8 [00:00<00:00, 176.85it/s]
Average Metric: 3.51 / 6 (58.5%): 62%|██████▎ | 5/8 [00:00<00:00, 218.64it/s]
Average Metric: 4.13 / 7 (59.0%): 75%|███████▌ | 6/8 [00:00<00:00, 259.37it/s]
Average Metric: 4.66 / 8 (58.2%): 88%|████████▊ | 7/8 [00:00<00:00, 299.27it/s]
Average Metric: 4.66 / 8 (58.2%): 100%|██████████| 8/8 [00:00<00:00, 341.14it/s]
0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 12%|█▎ | 1/8 [00:08<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 12%|█▎ | 1/8 [00:09<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 25%|██▌ | 2/8 [00:09<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 25%|██▌ | 2/8 [00:12<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 62%|██████▎ | 5/8 [00:12<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 62%|██████▎ | 5/8 [00:13<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 75%|███████▌ | 6/8 [00:13<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 75%|███████▌ | 6/8 [00:15<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 88%|████████▊ | 7/8 [00:15<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 88%|████████▊ | 7/8 [00:18<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 1.82s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 2.25s/it]
0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 0%| | 0/8 [00:11<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 12%|█▎ | 1/8 [00:11<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 12%|█▎ | 1/8 [00:14<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 38%|███▊ | 3/8 [00:14<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 38%|███▊ | 3/8 [00:19<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.08 / 5 (61.5%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 62%|██████▎ | 5/8 [00:20<00:12, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 75%|███████▌ | 6/8 [00:20<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 75%|███████▌ | 6/8 [00:21<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 88%|████████▊ | 7/8 [00:21<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 88%|████████▊ | 7/8 [00:27<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 2.98s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 3.41s/it]
0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 0%| | 0/8 [00:09<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 12%|█▎ | 1/8 [00:09<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 12%|█▎ | 1/8 [00:10<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 62%|██████▎ | 5/8 [00:10<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 62%|██████▎ | 5/8 [00:12<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 75%|███████▌ | 6/8 [00:12<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 75%|███████▌ | 6/8 [00:14<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 88%|████████▊ | 7/8 [00:14<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 88%|████████▊ | 7/8 [00:17<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 1.84s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 2.13s/it]
0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 12%|█▎ | 1/8 [00:08<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 12%|█▎ | 1/8 [00:09<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 25%|██▌ | 2/8 [00:09<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 25%|██▌ | 2/8 [00:10<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 2.46 / 4 (61.6%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 50%|█████ | 4/8 [00:10<00:11, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 62%|██████▎ | 5/8 [00:10<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 62%|██████▎ | 5/8 [00:11<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 75%|███████▌ | 6/8 [00:11<00:02, 1.07s/it]\n- latest run produced fresh metrics with , , \n- => for latest artifact