Skip to content

feat: MAD Confidence Scoring + HermesJudge [polished]#21

Open
AIandI0x1 wants to merge 1 commit intoNousResearch:mainfrom
AIandI0x1:feat/mad-confidence-scoring-polished
Open

feat: MAD Confidence Scoring + HermesJudge [polished]#21
AIandI0x1 wants to merge 1 commit intoNousResearch:mainfrom
AIandI0x1:feat/mad-confidence-scoring-polished

Conversation

@AIandI0x1
Copy link
Copy Markdown

Statistically rigorous quality gates for skill evolution.

What

When you evolve a skill, the system now tells you whether the improvement is real or noise — with proof.

Math: confidence = |mean_delta| / MAD

  • >= 2.0x → "likely real" — keep
  • >= 1.0x → "marginal" — borderline
  • < 1.0x → "within noise" — discard

New files

File Lines Purpose
evolution/core/mad_scoring.py 387 Pure MAD math (no dspy dep for basic functions)
evolution/core/hermes_judge.py 146 LLM-as-judge via hermes chat subprocess
docs/MAD_CONFIDENCE.md 70 Technical documentation
tests/core/test_mad_scoring.py 179 Test coverage

Modified files

File Change
evolution/skills/evolve_skill.py Holdout MAD evaluation, proof.json, --mad-trials/--judge-model CLI
evolution/core/config.py GPT-5.4 defaults, Nous Research API support
evolution/core/fitness.py temperature=1.0 on judge LM, MAD re-exports

Proven results (see PR #20 for full evidence)

Run Baseline Evolved Confidence Decision
Arxiv regression 0.739 0.656 3.32x likely real discard ✓
Self-evolution 0.772 0.875 1.03x marginal keep
Near-optimal 0.968 0.948 1.00x within noise discard ✓

Usage

python -m evolution.skills.evolve_skill --skill arxiv --mad-trials 3 --judge-model "gpt-5.4"

Dependencies

  • hermes CLI (for HermesJudge subprocess)
  • No new pip dependencies

Companion PR

This is the clean code PR. Full raw work with results, proofs, and test scripts is in PR #20.

Statistically rigorous quality gates for skill evolution.

## New files
- evolution/core/mad_scoring.py (387 lines) - MAD confidence scoring
- evolution/core/hermes_judge.py (146 lines) - LLM-as-judge via hermes chat
- docs/MAD_CONFIDENCE.md (70 lines) - technical documentation
- tests/core/test_mad_scoring.py (179 lines) - test coverage

## Modified files
- evolution/skills/evolve_skill.py - holdout MAD, proof.json, CLI flags
- evolution/core/config.py - GPT-5.4 defaults, Nous API
- evolution/core/fitness.py - temperature=1.0, MAD re-exports

## Key features
- Confidence = |mean_delta| / MAD on holdout deltas
- Labels: likely real (>=2.0x), marginal (>=1.0x), within noise (<1.0x)
- HermesJudge: no API key management, uses hermes chat subprocess
- proof.json: captures MIPROv2 optimized instructions + demos
- --mad-trials and --judge-model CLI flags
- Fix: constraint validator now checks full skill (with frontmatter)

## Proven results
- Arxiv regression caught: -11.2%, confidence 3.32x (likely real)
- Self-evolution: +13.4%, confidence 1.03x (marginal)
- Near-optimal baseline: -2.0%, confidence 1.00x (within noise)

No new pip dependencies. Requires hermes CLI for HermesJudge.
See also: PR NousResearch#20 for full raw work with results and proofs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant