feat: MAD Confidence Scoring + HermesJudge [polished] by AIandI0x1 · Pull Request #21 · NousResearch/hermes-agent-self-evolution

AIandI0x1 · 2026-04-13T21:22:43Z

Statistically rigorous quality gates for skill evolution.

What

When you evolve a skill, the system now tells you whether the improvement is real or noise — with proof.

Math: confidence = |mean_delta| / MAD

>= 2.0x → "likely real" — keep
>= 1.0x → "marginal" — borderline
< 1.0x → "within noise" — discard

New files

File	Lines	Purpose
`evolution/core/mad_scoring.py`	387	Pure MAD math (no dspy dep for basic functions)
`evolution/core/hermes_judge.py`	146	LLM-as-judge via `hermes chat` subprocess
`docs/MAD_CONFIDENCE.md`	70	Technical documentation
`tests/core/test_mad_scoring.py`	179	Test coverage

Modified files

File	Change
`evolution/skills/evolve_skill.py`	Holdout MAD evaluation, proof.json, `--mad-trials`/`--judge-model` CLI
`evolution/core/config.py`	GPT-5.4 defaults, Nous Research API support
`evolution/core/fitness.py`	`temperature=1.0` on judge LM, MAD re-exports

Proven results (see PR #20 for full evidence)

Run	Baseline	Evolved	Confidence	Decision
Arxiv regression	0.739	0.656	3.32x likely real	discard ✓
Self-evolution	0.772	0.875	1.03x marginal	keep
Near-optimal	0.968	0.948	1.00x within noise	discard ✓

Usage

python -m evolution.skills.evolve_skill --skill arxiv --mad-trials 3 --judge-model "gpt-5.4"

Dependencies

hermes CLI (for HermesJudge subprocess)
No new pip dependencies

Companion PR

This is the clean code PR. Full raw work with results, proofs, and test scripts is in PR #20.

Statistically rigorous quality gates for skill evolution. ## New files - evolution/core/mad_scoring.py (387 lines) - MAD confidence scoring - evolution/core/hermes_judge.py (146 lines) - LLM-as-judge via hermes chat - docs/MAD_CONFIDENCE.md (70 lines) - technical documentation - tests/core/test_mad_scoring.py (179 lines) - test coverage ## Modified files - evolution/skills/evolve_skill.py - holdout MAD, proof.json, CLI flags - evolution/core/config.py - GPT-5.4 defaults, Nous API - evolution/core/fitness.py - temperature=1.0, MAD re-exports ## Key features - Confidence = |mean_delta| / MAD on holdout deltas - Labels: likely real (>=2.0x), marginal (>=1.0x), within noise (<1.0x) - HermesJudge: no API key management, uses hermes chat subprocess - proof.json: captures MIPROv2 optimized instructions + demos - --mad-trials and --judge-model CLI flags - Fix: constraint validator now checks full skill (with frontmatter) ## Proven results - Arxiv regression caught: -11.2%, confidence 3.32x (likely real) - Self-evolution: +13.4%, confidence 1.03x (marginal) - Near-optimal baseline: -2.0%, confidence 1.00x (within noise) No new pip dependencies. Requires hermes CLI for HermesJudge. See also: PR NousResearch#20 for full raw work with results and proofs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MAD Confidence Scoring + HermesJudge [polished]#21

feat: MAD Confidence Scoring + HermesJudge [polished]#21
AIandI0x1 wants to merge 1 commit intoNousResearch:mainfrom
AIandI0x1:feat/mad-confidence-scoring-polished

AIandI0x1 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AIandI0x1 commented Apr 13, 2026

What

New files

Modified files

Proven results (see PR #20 for full evidence)

Usage

Dependencies

Companion PR

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant