Skip to content

Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30

Open
mkisontop wants to merge 10 commits intoNousResearch:mainfrom
mkisontop:main
Open

Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30
mkisontop wants to merge 10 commits intoNousResearch:mainfrom
mkisontop:main

Conversation

@mkisontop
Copy link
Copy Markdown

Summary

Phase 1 complete — end-to-end self-evolution pipeline for Hermes Agent skills.

10 commits from 4693c8f891329d:

  • AutoMergeGate with regression detection + bug fixes
  • Proposal artifact writer (15 tests)
  • Safe write-back with timestamped backups
  • T4 propose-mode E2E smoke tier + T5 zero-token tier
  • ProposalReviewer CLI (list/show/diff/approve/reject)
  • Nightly digest module with filters & renderers
  • nightly.sh pipeline (smoke → evolve → digest)
  • ARCHITECTURE.md + OPERATIONS.md + refreshed README

Verification

  • Tests: 218/218 passing (1.55s)
  • Smoke: default t1+t5 = 10.3s, zero tokens
  • Real propose-mode E2E: completed exit 0 — 3 iterations, 192.5s, MIPROv2 optimized, regression gate correctly caught -0.047 drop, proposal bundle written to proposals/github-code-review/20260418_181611/ with all 6 artifacts (baseline_skill.md, evolved_skill.md, constraints.json, decision.json, diff.patch, review.md)
  • Reviewer CLI: list and show both verified against real proposal

Behavior

  • Nightly defaults to propose-mode (no auto-merge)
  • Auto-merge requires: improvement ≥ threshold AND all constraints pass AND non-regression
  • Regressions always land as pending proposals for human review
  • All overwrites backed up with timestamp before write

Ready for review.

Bug 1 (no-op optimization): SkillModule previously passed skill text as
a runtime InputField — GEPA/MIPROv2 could not mutate that. Now the
skill body lives as the Predictor's signature.instructions, which IS
the parameter both optimizers actually rewrite. Read evolved body from
predictor.predict.signature.instructions.

Bug 2 (frontmatter validation): baseline was validated on body-only,
evolved on body-only — skill_structure check wanted YAML frontmatter
and failed. Now validate the full reassembled doc against baseline_text
body for the growth delta.

LM hang mitigation: dspy.LM(..., timeout=120, num_retries=2) forwarded
to litellm. Prevents a single hung call from wedging the whole eval.

Parallel holdout eval: dspy.evaluate.Evaluate with num_threads=4,
display_progress=True, max_errors tolerance — replaces silent serial
loop that hung on the 2026-04-18 smoke.

AutoMergeGate wired: evaluate(baseline, evolved, evolved_pass),
persist decision to metrics.json, exit 2 on regression so cron
catches it. Adds --mode {propose,auto}, --min-improvement,
--regression-tolerance, --proposals-dir CLI flags (propose
write-path still TODO).

All 143 tests pass.
- evolution/core/proposals.py: ProposalWriter, ProposalRecord, build_proposal_record, ConstraintRecord
- Writes 7 artifacts per proposal: baseline_skill.md, evolved_skill.md, diff.patch (unified), decision.json, constraints.json, review.md, STATUS
- Layout: {root}/{skill_name}/{YYYYMMDD_HHMMSS}/
- evolve_skill.py: wires propose-mode output (§9c) + auto-mode write-back (§9d)
- 15 tests cover record construction, artifact shape, diff content, regression flagging, multi-proposal layout
- Extract §9d into evolution/core/write_back.py (testable helper)
- Atomic overwrite via .tmp + rename
- Timestamped backup in .backups/ before every merge
- 12 new tests (guard rails, happy path, rollback safety)
- Total: 170 tests passing
Validates the full propose-mode contract:
- propose-mode run completes with exit 0
- live bundled SKILL.md mtime+sha256 unchanged
- proposal dir created with all 6 required artifacts
- decision.json schema valid (mode=propose, auto_merge=False)
- no .bak files created in live skill dir

Keeps t1 baseline + t4 propose-mode separate so nightly crons can
fail fast on either tier.
CLI for walking the proposals/ tree, inspecting proposals, and
flipping their STATUS tombstones:

  list [--status pending|approved|rejected]
    table of all proposals with icon + Δ score + merge-mode

  show <skill> [timestamp]
    print review.md (latest if timestamp omitted)

  diff <skill> [timestamp]
    print diff.patch

  approve <skill> [timestamp] [--no-merge] [--force] [--approved-by]
    flip STATUS → APPROVED (with approver metadata), then unless
    --no-merge, call write_back_skill() to atomically overwrite the
    live bundled SKILL.md with a timestamped backup. Writes
    merge_receipt.json into the proposal dir for auditability.

  reject <skill> [timestamp] [--reason] [--force] [--rejected-by]
    flip STATUS → REJECTED with reason + rejecter metadata

Safety:
- re-uses existing write_back_skill() safety path (atomic + backup)
- --force required to approve-rejected or reject-approved
- STATUS flipped before write-back so disk is authoritative even if
  merge fails partway
- approve with missing live skill returns exit 4 (STATUS still flipped)

24 tests covering discovery, listing, filtering, show/diff, all
approve paths (no-merge / write-back / backup creation / idempotence
/ --force guard / missing-live-skill), and reject paths. Full suite
now 194/194 passing.
- T4 (full E2E optimization) runs real GEPA + evals = ~4min, real tokens.
  Unsuitable for nightly cron; keep for manual weekly verification.
- NEW T5: propose-mode structural dry-run. ~2s, zero tokens. Validates
  CLI flags, skill discovery, frontmatter, proposals-dir writability.
- nightly.sh phase 1 now runs t1+t5 (both zero-token, ~10s total).
- smoke_test.sh default tier is now t1+t5. Use 'full' for the expensive
  T2/T3/T4 tiers.
- Fix evolve_skill.py: propose-mode no longer exits non-zero on
  regression — writing a proposal for review IS the success path in
  propose mode. Only auto-mode should exit non-zero on regression.
- Tests: 218/218 passing.

Nightly token cost: ~4min of LLM spend → 0 tokens in preflight. All
actual evolution still happens in phase 2.
…h header

- docs/ARCHITECTURE.md: module map, data flow, extension points (16KB)
- docs/OPERATIONS.md: daily workflow, cron, proposal review CLI, troubleshooting (12KB)
- README.md: rewritten around nightly workflow + reviewer CLI
- nightly.sh: header comment updated to reflect t1+t5 (was stale 't1+t4')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant