Skip to content

Fix nightly evolution reliability and enforce real skill mutation#17

Open
kirniy wants to merge 1 commit intoNousResearch:mainfrom
kirniy:main
Open

Fix nightly evolution reliability and enforce real skill mutation#17
kirniy wants to merge 1 commit intoNousResearch:mainfrom
kirniy:main

Conversation

@kirniy
Copy link
Copy Markdown

@kirniy kirniy commented Apr 12, 2026

What\n- add reliable nightly runner scripts with key-pool rotation\n- reduce nightly default iterations to 1 to avoid long optimizer overruns\n- add anti-gaming mutation gate and fallback body rewrite when DSPy only mutates prompt internals\n- evaluate evolved body via a dedicated SkillModule and persist mutation metadata\n\n## Why\nNightly runs were repeatedly failing or producing non-promotable no-op candidates (). This patch makes the pipeline produce real, auditable skill mutations and stable nightly artifacts.\n\n## Validation\n- reset pool + run Running nightly skill evolution: github-code-review

Repo: /Users/kirniy/dev/hermes-agent-self-evolution
Hermes repo: /Users/kirniy/.hermes/hermes-agent
Pool size: 4
Command: python Command: -m Command: evolution.skills.evolve_skill Command: --skill Command: github-code-review Command: --iterations Command: 1 Command: --eval-source Command: synthetic Command: --optimizer-model Command: gemini/gemini-3.1-pro-preview Command: --eval-model Command: gemini/gemini-3.1-pro-preview

Attempt 1/4 with key GOOGLE_API_KEY_3

🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review

Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...

Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout

Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)

Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview

Running GEPA optimization (1 iterations)...

GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Error getting data summary: litellm.BadRequestError: GeminiException BadRequestError - {
"error": {
"code": 400,
"message": "API key expired. Please renew the API key.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "API_KEY_INVALID",
"domain": "googleapis.com",
"metadata": {
"service": "generativelanguage.googleapis.com"
}
},
{
"@type": "type.googleapis.com/google.rpc.LocalizedMessage",
"locale": "en-US",
"message": "API key expired. Please renew the API key."
}
]
}
}
.

Running without data aware proposer.

Attempt 1 failed with exit 1

Attempt 2/4 with key GOOGLE_API_KEY_2

🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review

Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...

Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout

Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)

Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview

Running GEPA optimization (1 iterations)...

GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.50 / 1 (49.6%): 0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 1.14 / 2 (56.8%): 12%|█▎ | 1/8 [00:00<00:00, 46.53it/s]
Average Metric: 1.79 / 3 (59.5%): 25%|██▌ | 2/8 [00:00<00:00, 91.00it/s]
Average Metric: 2.31 / 4 (57.8%): 38%|███▊ | 3/8 [00:00<00:00, 134.36it/s]
Average Metric: 2.91 / 5 (58.2%): 50%|█████ | 4/8 [00:00<00:00, 176.85it/s]
Average Metric: 3.51 / 6 (58.5%): 62%|██████▎ | 5/8 [00:00<00:00, 218.64it/s]
Average Metric: 4.13 / 7 (59.0%): 75%|███████▌ | 6/8 [00:00<00:00, 259.37it/s]
Average Metric: 4.66 / 8 (58.2%): 88%|████████▊ | 7/8 [00:00<00:00, 299.27it/s]
Average Metric: 4.66 / 8 (58.2%): 100%|██████████| 8/8 [00:00<00:00, 341.14it/s]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 12%|█▎ | 1/8 [00:08<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 12%|█▎ | 1/8 [00:09<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 25%|██▌ | 2/8 [00:09<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 25%|██▌ | 2/8 [00:12<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 62%|██████▎ | 5/8 [00:12<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 62%|██████▎ | 5/8 [00:13<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 75%|███████▌ | 6/8 [00:13<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 75%|███████▌ | 6/8 [00:15<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 88%|████████▊ | 7/8 [00:15<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 88%|████████▊ | 7/8 [00:18<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 1.82s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 2.25s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 0%| | 0/8 [00:11<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 12%|█▎ | 1/8 [00:11<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 12%|█▎ | 1/8 [00:14<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 38%|███▊ | 3/8 [00:14<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 38%|███▊ | 3/8 [00:19<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.08 / 5 (61.5%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 62%|██████▎ | 5/8 [00:20<00:12, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 75%|███████▌ | 6/8 [00:20<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 75%|███████▌ | 6/8 [00:21<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 88%|████████▊ | 7/8 [00:21<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 88%|████████▊ | 7/8 [00:27<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 2.98s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 3.41s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 0%| | 0/8 [00:09<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 12%|█▎ | 1/8 [00:09<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 12%|█▎ | 1/8 [00:10<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 62%|██████▎ | 5/8 [00:10<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 62%|██████▎ | 5/8 [00:12<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 75%|███████▌ | 6/8 [00:12<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 75%|███████▌ | 6/8 [00:14<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 88%|████████▊ | 7/8 [00:14<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 88%|████████▊ | 7/8 [00:17<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 1.84s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 2.13s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 12%|█▎ | 1/8 [00:08<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 12%|█▎ | 1/8 [00:09<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 25%|██▌ | 2/8 [00:09<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 25%|██▌ | 2/8 [00:10<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 2.46 / 4 (61.6%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 50%|█████ | 4/8 [00:10<00:11, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 62%|██████▎ | 5/8 [00:10<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 62%|██████▎ | 5/8 [00:11<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 75%|███████▌ | 6/8 [00:11<00:02, 1.07s/it]\n- latest run produced fresh metrics with , , \n- => for latest artifact

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant