Fix nightly evolution reliability and enforce real skill mutation by kirniy · Pull Request #17 · NousResearch/hermes-agent-self-evolution

kirniy · 2026-04-12T02:51:07Z

What\n- add reliable nightly runner scripts with key-pool rotation\n- reduce nightly default iterations to 1 to avoid long optimizer overruns\n- add anti-gaming mutation gate and fallback body rewrite when DSPy only mutates prompt internals\n- evaluate evolved body via a dedicated SkillModule and persist mutation metadata\n\n## Why\nNightly runs were repeatedly failing or producing non-promotable no-op candidates (). This patch makes the pipeline produce real, auditable skill mutations and stable nightly artifacts.\n\n## Validation\n- reset pool + run Running nightly skill evolution: github-code-review

Repo: /Users/kirniy/dev/hermes-agent-self-evolution
Hermes repo: /Users/kirniy/.hermes/hermes-agent
Pool size: 4
Command: python Command: -m Command: evolution.skills.evolve_skill Command: --skill Command: github-code-review Command: --iterations Command: 1 Command: --eval-source Command: synthetic Command: --optimizer-model Command: gemini/gemini-3.1-pro-preview Command: --eval-model Command: gemini/gemini-3.1-pro-preview

Attempt 1/4 with key GOOGLE_API_KEY_3

🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review

Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...

Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout

Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)

Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview

Running GEPA optimization (1 iterations)...

GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Error getting data summary: litellm.BadRequestError: GeminiException BadRequestError - {
"error": {
"code": 400,
"message": "API key expired. Please renew the API key.",
"status": "INVALID_ARGUMENT",
"details": [
{
"@type": "type.googleapis.com/google.rpc.ErrorInfo",
"reason": "API_KEY_INVALID",
"domain": "googleapis.com",
"metadata": {
"service": "generativelanguage.googleapis.com"
}
},
{
"@type": "type.googleapis.com/google.rpc.LocalizedMessage",
"locale": "en-US",
"message": "API key expired. Please renew the API key."
}
]
}
}
.

Running without data aware proposer.

Attempt 1 failed with exit 1

Attempt 2/4 with key GOOGLE_API_KEY_2

🧬 Hermes Agent Self-Evolution — Evolving skill: github-code-review

Loaded: skills/github/github-code-review/SKILL.md
Name: github-code-review
Size: 13,555 chars
Description: Review code changes by analyzing git diffs, leaving inline
comments on PRs, and ...

Building evaluation dataset (source: synthetic)
Generated 20 synthetic examples
Saved to datasets/skills/github-code-review/
Split: 10 train / 5 val / 5 holdout

Validating baseline constraints
✓ size_limit: Size OK: 13555/15000 chars
✓ non_empty: Artifact is non-empty
✓ skill_structure: Skill has valid frontmatter (name + description)

Configuring optimizer
Optimizer: GEPA (1 iterations)
Optimizer model: gemini/gemini-3.1-pro-preview
Eval model: gemini/gemini-3.1-pro-preview

Running GEPA optimization (1 iterations)...

GEPA not available (GEPA.init() got an unexpected keyword argument
'max_steps'), falling back to MIPROv2
Bootstrapping set 1/6
Bootstrapping set 2/6
Bootstrapping set 3/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.
Bootstrapping set 4/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 5/6
Bootstrapped 1 full traces after 1 examples for up to 1 rounds, amounting to 1 attempts.
Bootstrapping set 6/6
Bootstrapped 2 full traces after 1 examples for up to 1 rounds, amounting to 2 attempts.

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.50 / 1 (49.6%): 0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 1.14 / 2 (56.8%): 12%|█▎ | 1/8 [00:00<00:00, 46.53it/s]
Average Metric: 1.79 / 3 (59.5%): 25%|██▌ | 2/8 [00:00<00:00, 91.00it/s]
Average Metric: 2.31 / 4 (57.8%): 38%|███▊ | 3/8 [00:00<00:00, 134.36it/s]
Average Metric: 2.91 / 5 (58.2%): 50%|█████ | 4/8 [00:00<00:00, 176.85it/s]
Average Metric: 3.51 / 6 (58.5%): 62%|██████▎ | 5/8 [00:00<00:00, 218.64it/s]
Average Metric: 4.13 / 7 (59.0%): 75%|███████▌ | 6/8 [00:00<00:00, 259.37it/s]
Average Metric: 4.66 / 8 (58.2%): 88%|████████▊ | 7/8 [00:00<00:00, 299.27it/s]
Average Metric: 4.66 / 8 (58.2%): 100%|██████████| 8/8 [00:00<00:00, 341.14it/s]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.70 / 1 (70.0%): 12%|█▎ | 1/8 [00:08<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 12%|█▎ | 1/8 [00:09<01:02, 8.88s/it]
Average Metric: 1.28 / 2 (64.0%): 25%|██▌ | 2/8 [00:09<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 25%|██▌ | 2/8 [00:12<00:25, 4.28s/it]
Average Metric: 1.85 / 3 (61.8%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 38%|███▊ | 3/8 [00:12<00:16, 3.33s/it]
Average Metric: 2.46 / 4 (61.4%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 50%|█████ | 4/8 [00:12<00:08, 2.06s/it]
Average Metric: 3.12 / 5 (62.3%): 62%|██████▎ | 5/8 [00:12<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 62%|██████▎ | 5/8 [00:13<00:04, 1.47s/it]
Average Metric: 3.77 / 6 (62.8%): 75%|███████▌ | 6/8 [00:13<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 75%|███████▌ | 6/8 [00:15<00:02, 1.34s/it]
Average Metric: 4.38 / 7 (62.6%): 88%|████████▊ | 7/8 [00:15<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 88%|████████▊ | 7/8 [00:18<00:01, 1.41s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 1.82s/it]
Average Metric: 4.93 / 8 (61.7%): 100%|██████████| 8/8 [00:18<00:00, 2.25s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 0%| | 0/8 [00:11<?, ?it/s]
Average Metric: 0.58 / 1 (58.0%): 12%|█▎ | 1/8 [00:11<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 12%|█▎ | 1/8 [00:14<01:22, 11.74s/it]
Average Metric: 1.31 / 2 (65.7%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 25%|██▌ | 2/8 [00:14<00:37, 6.21s/it]
Average Metric: 1.89 / 3 (62.9%): 38%|███▊ | 3/8 [00:14<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 38%|███▊ | 3/8 [00:19<00:17, 3.46s/it]
Average Metric: 2.48 / 4 (62.0%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.08 / 5 (61.5%): 50%|█████ | 4/8 [00:19<00:17, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 62%|██████▎ | 5/8 [00:20<00:12, 4.29s/it]
Average Metric: 3.65 / 6 (60.8%): 75%|███████▌ | 6/8 [00:20<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 75%|███████▌ | 6/8 [00:21<00:04, 2.20s/it]
Average Metric: 4.22 / 7 (60.3%): 88%|████████▊ | 7/8 [00:21<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 88%|████████▊ | 7/8 [00:27<00:01, 1.85s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 2.98s/it]
Average Metric: 4.72 / 8 (59.0%): 100%|██████████| 8/8 [00:27<00:00, 3.41s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 0%| | 0/8 [00:09<?, ?it/s]
Average Metric: 0.66 / 1 (66.1%): 12%|█▎ | 1/8 [00:09<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 12%|█▎ | 1/8 [00:10<01:07, 9.65s/it]
Average Metric: 1.39 / 2 (69.7%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 25%|██▌ | 2/8 [00:10<00:25, 4.21s/it]
Average Metric: 1.94 / 3 (64.6%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 38%|███▊ | 3/8 [00:10<00:11, 2.37s/it]
Average Metric: 2.49 / 4 (62.2%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 50%|█████ | 4/8 [00:10<00:05, 1.48s/it]
Average Metric: 3.09 / 5 (61.7%): 62%|██████▎ | 5/8 [00:10<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 62%|██████▎ | 5/8 [00:12<00:03, 1.16s/it]
Average Metric: 3.69 / 6 (61.5%): 75%|███████▌ | 6/8 [00:12<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 75%|███████▌ | 6/8 [00:14<00:02, 1.18s/it]
Average Metric: 4.34 / 7 (62.0%): 88%|████████▊ | 7/8 [00:14<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 88%|████████▊ | 7/8 [00:17<00:01, 1.66s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 1.84s/it]
Average Metric: 4.83 / 8 (60.4%): 100%|██████████| 8/8 [00:17<00:00, 2.13s/it]

0%| | 0/8 [00:00<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 0%| | 0/8 [00:08<?, ?it/s]
Average Metric: 0.54 / 1 (54.5%): 12%|█▎ | 1/8 [00:08<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 12%|█▎ | 1/8 [00:09<00:56, 8.09s/it]
Average Metric: 1.24 / 2 (62.2%): 25%|██▌ | 2/8 [00:09<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 25%|██▌ | 2/8 [00:10<00:23, 3.99s/it]
Average Metric: 1.87 / 3 (62.2%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 2.46 / 4 (61.6%): 38%|███▊ | 3/8 [00:10<00:14, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 50%|█████ | 4/8 [00:10<00:11, 2.80s/it]
Average Metric: 3.04 / 5 (60.8%): 62%|██████▎ | 5/8 [00:10<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 62%|██████▎ | 5/8 [00:11<00:03, 1.33s/it]
Average Metric: 3.61 / 6 (60.1%): 75%|███████▌ | 6/8 [00:11<00:02, 1.07s/it]\n- latest run produced fresh metrics with , , \n- => for latest artifact

Fix nightly evolution reliability and enforce real skill mutation

1656eff

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nightly evolution reliability and enforce real skill mutation#17

Fix nightly evolution reliability and enforce real skill mutation#17
kirniy wants to merge 1 commit intoNousResearch:mainfrom
kirniy:main

kirniy commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kirniy commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant