Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30
Open
mkisontop wants to merge 10 commits intoNousResearch:mainfrom
Open
Phase 1: auto-merge gate, proposal system, reviewer CLI, nightly pipeline, docs#30mkisontop wants to merge 10 commits intoNousResearch:mainfrom
mkisontop wants to merge 10 commits intoNousResearch:mainfrom
Conversation
Bug 1 (no-op optimization): SkillModule previously passed skill text as
a runtime InputField — GEPA/MIPROv2 could not mutate that. Now the
skill body lives as the Predictor's signature.instructions, which IS
the parameter both optimizers actually rewrite. Read evolved body from
predictor.predict.signature.instructions.
Bug 2 (frontmatter validation): baseline was validated on body-only,
evolved on body-only — skill_structure check wanted YAML frontmatter
and failed. Now validate the full reassembled doc against baseline_text
body for the growth delta.
LM hang mitigation: dspy.LM(..., timeout=120, num_retries=2) forwarded
to litellm. Prevents a single hung call from wedging the whole eval.
Parallel holdout eval: dspy.evaluate.Evaluate with num_threads=4,
display_progress=True, max_errors tolerance — replaces silent serial
loop that hung on the 2026-04-18 smoke.
AutoMergeGate wired: evaluate(baseline, evolved, evolved_pass),
persist decision to metrics.json, exit 2 on regression so cron
catches it. Adds --mode {propose,auto}, --min-improvement,
--regression-tolerance, --proposals-dir CLI flags (propose
write-path still TODO).
All 143 tests pass.
- evolution/core/proposals.py: ProposalWriter, ProposalRecord, build_proposal_record, ConstraintRecord
- Writes 7 artifacts per proposal: baseline_skill.md, evolved_skill.md, diff.patch (unified), decision.json, constraints.json, review.md, STATUS
- Layout: {root}/{skill_name}/{YYYYMMDD_HHMMSS}/
- evolve_skill.py: wires propose-mode output (§9c) + auto-mode write-back (§9d)
- 15 tests cover record construction, artifact shape, diff content, regression flagging, multi-proposal layout
- Extract §9d into evolution/core/write_back.py (testable helper) - Atomic overwrite via .tmp + rename - Timestamped backup in .backups/ before every merge - 12 new tests (guard rails, happy path, rollback safety) - Total: 170 tests passing
Validates the full propose-mode contract: - propose-mode run completes with exit 0 - live bundled SKILL.md mtime+sha256 unchanged - proposal dir created with all 6 required artifacts - decision.json schema valid (mode=propose, auto_merge=False) - no .bak files created in live skill dir Keeps t1 baseline + t4 propose-mode separate so nightly crons can fail fast on either tier.
CLI for walking the proposals/ tree, inspecting proposals, and
flipping their STATUS tombstones:
list [--status pending|approved|rejected]
table of all proposals with icon + Δ score + merge-mode
show <skill> [timestamp]
print review.md (latest if timestamp omitted)
diff <skill> [timestamp]
print diff.patch
approve <skill> [timestamp] [--no-merge] [--force] [--approved-by]
flip STATUS → APPROVED (with approver metadata), then unless
--no-merge, call write_back_skill() to atomically overwrite the
live bundled SKILL.md with a timestamped backup. Writes
merge_receipt.json into the proposal dir for auditability.
reject <skill> [timestamp] [--reason] [--force] [--rejected-by]
flip STATUS → REJECTED with reason + rejecter metadata
Safety:
- re-uses existing write_back_skill() safety path (atomic + backup)
- --force required to approve-rejected or reject-approved
- STATUS flipped before write-back so disk is authoritative even if
merge fails partway
- approve with missing live skill returns exit 4 (STATUS still flipped)
24 tests covering discovery, listing, filtering, show/diff, all
approve paths (no-merge / write-back / backup creation / idempotence
/ --force guard / missing-live-skill), and reject paths. Full suite
now 194/194 passing.
- T4 (full E2E optimization) runs real GEPA + evals = ~4min, real tokens. Unsuitable for nightly cron; keep for manual weekly verification. - NEW T5: propose-mode structural dry-run. ~2s, zero tokens. Validates CLI flags, skill discovery, frontmatter, proposals-dir writability. - nightly.sh phase 1 now runs t1+t5 (both zero-token, ~10s total). - smoke_test.sh default tier is now t1+t5. Use 'full' for the expensive T2/T3/T4 tiers. - Fix evolve_skill.py: propose-mode no longer exits non-zero on regression — writing a proposal for review IS the success path in propose mode. Only auto-mode should exit non-zero on regression. - Tests: 218/218 passing. Nightly token cost: ~4min of LLM spend → 0 tokens in preflight. All actual evolution still happens in phase 2.
…h header - docs/ARCHITECTURE.md: module map, data flow, extension points (16KB) - docs/OPERATIONS.md: daily workflow, cron, proposal review CLI, troubleshooting (12KB) - README.md: rewritten around nightly workflow + reviewer CLI - nightly.sh: header comment updated to reflect t1+t5 (was stale 't1+t4')
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 complete — end-to-end self-evolution pipeline for Hermes Agent skills.
10 commits from
4693c8f→891329d:Verification
proposals/github-code-review/20260418_181611/with all 6 artifacts (baseline_skill.md, evolved_skill.md, constraints.json, decision.json, diff.patch, review.md)listandshowboth verified against real proposalBehavior
Ready for review.