Skip to content

perf(verification): shrink and speed up SMT verification conditions#408

Closed
alcides wants to merge 1 commit into
masterfrom
claude/synthesis-vc-speedups
Closed

perf(verification): shrink and speed up SMT verification conditions#408
alcides wants to merge 1 commit into
masterfrom
claude/synthesis-vc-speedups

Conversation

@alcides

@alcides alcides commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Summary

Three improvements to the liquid-type verification path (the SMT bottleneck that dominates synthesis time), all behaviour-preserving on the standard (GIL) build:

  1. flatten: O(depth²) → O(depth) alpha-renaming. The eager per-binder rename rebuilt the entire remaining constraint at every , making substitution_in_liquid the single hottest function in the system. Renamings are now deferred and applied in one pass per premise/conclusion. ~2.2× faster flatten on deep constraints; verified structurally identical to the old output over 300+ real obligations.

  2. VC simplification before Z3 (_simplify_vc): equality elimination + relevance slicing.

    • Equality elimination (exact): a premise x == Y with x a universally-quantified binder and Y free of x pins x, so substitute x := Y and drop both. Collapses the ANF h_i == e_i chains.
    • Relevance slicing: keep only premises/functions/variables transitively connected to the goal — removes the irrelevant open-library declarations that bloat VCs.
    • 3.48× faster validation on deep-constraint programs; 0 verdict changes over 4,460 obligations across 44 programs.
  3. Thread-local Z3 context (_WS) + lock-atomic fresh_counter. Makes the SMT layer safe for parallel candidate validation under free-threaded CPython (python3.14t). The main thread keeps the default Z3 context, so external importers and GIL-build behaviour are byte-identical. Enables ~1.4–2× parallel validation of synthesis populations on 3.14t. (The parallel driver itself is not yet wired into the GP loop — this lands the infrastructure + correctness.)

Motivating example: supermario

supermario's typecheck is dominated by one obligation — the synthesis-target VC, which accumulates the entire open Array library + inductive context into a single 3.5M-char formula. Dissected:

flatten   484 ms   (15%)   ← the O(depth²) renaming (fix #1)
translate 2333 ms  (73%)   ← building Z3 ASTs for a mostly-irrelevant formula (fix #2)
z3 solve     1 ms  (~0%)   ← it was never a hard solve

After #1+#2 that obligation drops 3604 ms → 1112 ms (3.2×).

How fast is synthesis in the different backends

Throughput per backend, standard (GIL) build, seed=1, 15 s budget, with these changes in place:

even_parity.ae (boolean GP benchmark)

backend candidates cand/s
gp 706 43
random_search 655 43
enumerative 1345 89
hc 656 43
1p1 654 43

koza_quartic.ae (symbolic regression, deeper float constraints)

backend candidates cand/s
gp 359 21
random_search 308 20
enumerative 156 10
hc 311 20
1p1 306 20

The backends bottleneck on the same validation path, so the relative ordering is driven by candidate-generation cost (enumerative is cheap-per-candidate on shallow boolean goals but expensive on deep ones). The optimizations accelerate the validation phase shared by all of them — small for shallow goals like even_parity, large (up to ~3.5×) for deep-constraint synthesis (lists, ADTs, supermario-style).

The factory-routed backends (synquid, sygus, afta, cata) couldn't be benchmarked here: this machine's scipy wheel fails to dlopen (macOS __thread_bss), which breaks the CLI's decision_tree import. The five geneticengine backends run via the test harness, which bypasses it.

Soundness & testing

  • 200+ single-threaded SMT/typechecking tests pass. New regression tests: tests/flatten_rename_test.py, tests/alpha_key_test.py.
  • Verdict differential: 0 mismatches over 4,460 obligations / 44 programs (simplification on vs off).
  • Parallel == serial verified for the thread-local path across every measured workload.
  • mypy + ruff clean.

Caveats (please review)

  • Relevance slicing is always sound (dropping premises only weakens the hypothesis — it can never make an invalid program typecheck) but is theoretically incomplete in one narrow case: a premise that is both disconnected from the goal and internally contradictory (a dead-code refinement context) would vacuously discharge the goal, and dropping it could turn that into a false reject. Not observed across the 4,460-obligation differential. A "re-check the full VC on failure" fallback was deliberately omitted (during synthesis most candidates are invalid, so it would erase the speedup). If preferred, slicing can be gated to the synthesis path only, keeping the user-facing typechecker provably complete — happy to do that.
  • The free-threading refactor (User-defined types are not BasicTypes #3) is infrastructure; the parallel GP driver is a follow-up.

🤖 Generated with Claude Code

Three improvements to the liquid-type verification path, all
behaviour-preserving on the standard (GIL) build (200+ SMT/typechecking
tests pass):

1. flatten: O(depth^2) -> O(depth) binder alpha-renaming. The eager
   per-binder rename rebuilt the whole remaining constraint at every
   forall, making substitution_in_liquid the hottest function in the
   system. Renamings are now deferred and applied in one pass per
   premise/conclusion. ~2.2x faster flatten on deep constraints;
   verified structurally identical to the old output over 300+ real
   obligations (tests/flatten_rename_test.py).

2. VC simplification before Z3 (smt._simplify_vc): equality elimination
   + relevance slicing. Equality elimination (exact) substitutes x:=Y
   for premises x==Y and drops the binder, collapsing ANF h_i==e_i
   chains. Relevance slicing keeps only premises/functions/variables
   transitively connected to the goal, removing irrelevant `open`-library
   declarations. 3.48x faster validation on deep-constraint programs
   (supermario's 3.5M-char obligation 3.6s -> 1.1s); 0 verdict changes
   over 4,460 obligations across 44 programs. Slicing is always sound;
   see the docstring for the narrow disconnected-inconsistent-premise
   completeness caveat.

3. Thread-local Z3 context (smt._WS) + lock-atomic fresh_counter: makes
   the SMT layer safe for parallel candidate validation under
   free-threaded CPython (python3.14t). The main thread keeps the default
   Z3 context, so external importers and GIL-build behaviour are
   unchanged. Enables ~1.4-2x parallel validation of synthesis
   populations on 3.14t (the parallel driver itself is not yet wired in).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alcides alcides closed this Jun 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant