Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs by mkisontop · Pull Request #30 · NousResearch/hermes-agent-self-evolution

mkisontop · 2026-04-18T14:49:11Z

Summary

Phase 1 complete — end-to-end self-evolution pipeline for Hermes Agent skills.

10 commits from 4693c8f → 891329d:

AutoMergeGate with regression detection + bug fixes
Proposal artifact writer (15 tests)
Safe write-back with timestamped backups
T4 propose-mode E2E smoke tier + T5 zero-token tier
ProposalReviewer CLI (list/show/diff/approve/reject)
Nightly digest module with filters & renderers
nightly.sh pipeline (smoke → evolve → digest)
ARCHITECTURE.md + OPERATIONS.md + refreshed README

Verification

Tests: 218/218 passing (1.55s)
Smoke: default t1+t5 = 10.3s, zero tokens
Real propose-mode E2E: completed exit 0 — 3 iterations, 192.5s, MIPROv2 optimized, regression gate correctly caught -0.047 drop, proposal bundle written to proposals/github-code-review/20260418_181611/ with all 6 artifacts (baseline_skill.md, evolved_skill.md, constraints.json, decision.json, diff.patch, review.md)
Reviewer CLI: list and show both verified against real proposal

Behavior

Nightly defaults to propose-mode (no auto-merge)
Auto-merge requires: improvement ≥ threshold AND all constraints pass AND non-regression
Regressions always land as pending proposals for human review
All overwrites backed up with timestamp before write

Ready for review.

Bug 1 (no-op optimization): SkillModule previously passed skill text as a runtime InputField — GEPA/MIPROv2 could not mutate that. Now the skill body lives as the Predictor's signature.instructions, which IS the parameter both optimizers actually rewrite. Read evolved body from predictor.predict.signature.instructions. Bug 2 (frontmatter validation): baseline was validated on body-only, evolved on body-only — skill_structure check wanted YAML frontmatter and failed. Now validate the full reassembled doc against baseline_text body for the growth delta. LM hang mitigation: dspy.LM(..., timeout=120, num_retries=2) forwarded to litellm. Prevents a single hung call from wedging the whole eval. Parallel holdout eval: dspy.evaluate.Evaluate with num_threads=4, display_progress=True, max_errors tolerance — replaces silent serial loop that hung on the 2026-04-18 smoke. AutoMergeGate wired: evaluate(baseline, evolved, evolved_pass), persist decision to metrics.json, exit 2 on regression so cron catches it. Adds --mode {propose,auto}, --min-improvement, --regression-tolerance, --proposals-dir CLI flags (propose write-path still TODO). All 143 tests pass.

- evolution/core/proposals.py: ProposalWriter, ProposalRecord, build_proposal_record, ConstraintRecord - Writes 7 artifacts per proposal: baseline_skill.md, evolved_skill.md, diff.patch (unified), decision.json, constraints.json, review.md, STATUS - Layout: {root}/{skill_name}/{YYYYMMDD_HHMMSS}/ - evolve_skill.py: wires propose-mode output (§9c) + auto-mode write-back (§9d) - 15 tests cover record construction, artifact shape, diff content, regression flagging, multi-proposal layout

- Extract §9d into evolution/core/write_back.py (testable helper) - Atomic overwrite via .tmp + rename - Timestamped backup in .backups/ before every merge - 12 new tests (guard rails, happy path, rollback safety) - Total: 170 tests passing

Validates the full propose-mode contract: - propose-mode run completes with exit 0 - live bundled SKILL.md mtime+sha256 unchanged - proposal dir created with all 6 required artifacts - decision.json schema valid (mode=propose, auto_merge=False) - no .bak files created in live skill dir Keeps t1 baseline + t4 propose-mode separate so nightly crons can fail fast on either tier.

CLI for walking the proposals/ tree, inspecting proposals, and flipping their STATUS tombstones: list [--status pending|approved|rejected] table of all proposals with icon + Δ score + merge-mode show <skill> [timestamp] print review.md (latest if timestamp omitted) diff <skill> [timestamp] print diff.patch approve <skill> [timestamp] [--no-merge] [--force] [--approved-by] flip STATUS → APPROVED (with approver metadata), then unless --no-merge, call write_back_skill() to atomically overwrite the live bundled SKILL.md with a timestamped backup. Writes merge_receipt.json into the proposal dir for auditability. reject <skill> [timestamp] [--reason] [--force] [--rejected-by] flip STATUS → REJECTED with reason + rejecter metadata Safety: - re-uses existing write_back_skill() safety path (atomic + backup) - --force required to approve-rejected or reject-approved - STATUS flipped before write-back so disk is authoritative even if merge fails partway - approve with missing live skill returns exit 4 (STATUS still flipped) 24 tests covering discovery, listing, filtering, show/diff, all approve paths (no-merge / write-back / backup creation / idempotence / --force guard / missing-live-skill), and reject paths. Full suite now 194/194 passing.

- T4 (full E2E optimization) runs real GEPA + evals = ~4min, real tokens. Unsuitable for nightly cron; keep for manual weekly verification. - NEW T5: propose-mode structural dry-run. ~2s, zero tokens. Validates CLI flags, skill discovery, frontmatter, proposals-dir writability. - nightly.sh phase 1 now runs t1+t5 (both zero-token, ~10s total). - smoke_test.sh default tier is now t1+t5. Use 'full' for the expensive T2/T3/T4 tiers. - Fix evolve_skill.py: propose-mode no longer exits non-zero on regression — writing a proposal for review IS the success path in propose mode. Only auto-mode should exit non-zero on regression. - Tests: 218/218 passing. Nightly token cost: ~4min of LLM spend → 0 tokens in preflight. All actual evolution still happens in phase 2.

…h header - docs/ARCHITECTURE.md: module map, data flow, extension points (16KB) - docs/OPERATIONS.md: daily workflow, cron, proposal review CLI, troubleshooting (12KB) - README.md: rewritten around nightly workflow + reviewer CLI - nightly.sh: header comment updated to reflect t1+t5 (was stale 't1+t4')

mkisontop added 10 commits April 17, 2026 23:47

feat(evolution): AutoMergeGate with regression detection

881c2f0

feat(review): nightly digest module (filters, renderers, CLI)

3938e13

feat(ops): nightly.sh pipeline (smoke → evolve → digest)

d9d0566

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30

Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30
mkisontop wants to merge 10 commits intoNousResearch:mainfrom
mkisontop:main

mkisontop commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkisontop commented Apr 18, 2026

Summary

Verification

Behavior

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant