Skip to content

Make kernel crash validation use a real QEMU guest path#139

Merged
peaktwilight merged 6 commits intomainfrom
feat/kernel-vm-assets
Apr 13, 2026
Merged

Make kernel crash validation use a real QEMU guest path#139
peaktwilight merged 6 commits intomainfrom
feat/kernel-vm-assets

Conversation

@peaktwilight
Copy link
Copy Markdown
Collaborator

Summary

  • replace the kernel VM verifier SSH transport with a shared-workdir QEMU execution path
  • add a dedicated GitHub Actions kernel-validator E2E workflow
  • align the kernel VM docs and builder guidance with the new transport

Verification

  • pnpm build
  • local real syzbot ingest --verify run with reproduced=true inside QEMU
  • local execution of scripts/kernel-validator-e2e.sh

Notes

  • the transport is now real end to end, but crash-signature matching still needs refinement on noisy UBSAN-first outputs

Replace the fragile SSH transport with a 9p shared-folder execution path so the QEMU/KASAN verifier can compile and run reproducers without guest network bring-up.

The guest now boots through pwnkit-init, mounts a host-provided share, executes a generated runner script, and writes compile/run artifacts back for host-side collection. The rootfs builder was updated to include the tools this path actually needs, and the docs/build guidance now match the implementation.

Constraint: The verifier must run on hosts where guest SSH is unreliable or unavailable under KASAN/UBSAN noise
Rejected: Keep debugging guest SSH | transport remained flaky despite successful boot and custom init execution
Rejected: Full kernel rebuild for every rootfs tweak | too slow once existing artifact config already included 9p virtio support
Confidence: medium
Scope-risk: moderate
Reversibility: clean
Directive: Crash-matching logic still needs refinement for noisy UBSAN-first outputs; do not treat transport success as signature-match success
Tested: pnpm build; real QEMU boot with shared-folder guest script; real syzbot ingest --verify compile+run against RDMA invalid-free sample
Not-tested: Additional real syzbot crash families beyond the exercised invalid-free sample
Create a dedicated GitHub Actions workflow that builds the kernel VM artifacts, boots QEMU, and runs ingest --verify against a real syzbot crash/reproducer pair.

The workflow keeps preserved VM artifacts for debugging, and the runner now supports an optional artifact directory so CI can upload serial logs, compile logs, and dmesg instead of deleting them. The shell harness also normalizes the current CLI JSON output and asserts the transport actually reproduced a real guest run.

Constraint: We need a real QEMU-backed CI proof path without making every unrelated push pay for a heavyweight kernel VM build
Rejected: Fold the VM run into the default CI job | too expensive and too broad for unrelated changes
Rejected: Assert verified=true today | current matcher still mis-scores noisy real outputs even when the VM transport succeeds
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep this workflow path-scoped or manual unless signature matching becomes stable enough to justify a broader gate
Tested: pnpm build; local execution of scripts/kernel-validator-e2e.sh against the rebuilt VM artifacts; reproduced=true on real syzbot sample
Not-tested: GitHub-hosted runner wall-clock for the full workflow end-to-end
The repository docs still described the kernel validator as SSH-first even after the transport was replaced with a host-shared QEMU execution path.

This pass updates the root README, kernel VM README, builder script comments, and Dockerfile header comments so the documented behavior matches the actual guest boot flow and CI lane.

Constraint: The branch now includes a real QEMU guest runner and a dedicated CI workflow, so stale SSH wording would mislead anyone trying to use or review it
Rejected: Leave the old SSH wording in place until later | would make the new workflow and docs contradict each other immediately
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If SSH is ever removed from the rootfs entirely, also delete the remaining exported key artifacts rather than leaving dead outputs around
Tested: Prior pnpm build still applies; docs/comment-only pass
Not-tested: Fresh docs site build after this wording-only cleanup
The kernel-validator workflow was spending most of its time rebuilding the VM artifact bundle on every PR run, even when the kernel VM inputs had not changed.

Restore and save the exported bzImage/rootfs bundle via actions/cache keyed on the kernel VM Dockerfile and build script. This keeps the real QEMU E2E gate intact while making repeat PR runs and reruns materially faster.

Constraint: The workflow still needs to exercise a real guest boot, but repeated full kernel builds on every rerun are too expensive for normal iteration
Rejected: Drop the artifact build step entirely | would remove the guarantee that the tested VM inputs match the branch contents
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: If the artifact cache proves too large or flaky, switch to a dedicated artifact-producing workflow instead of broadening the default CI path
Tested: Workflow YAML inspection; cache key/path wiring review
Not-tested: End-to-end GitHub cache hit on a second run yet
GitHub rejected the cache-enabled workflow before job creation because runner.temp was referenced from jobs.<job>.env, where the runner context is not available.

Move those paths back to step-local expressions and shell variables so the workflow can instantiate normally while keeping the cache behavior intact.

Constraint: GitHub Actions context availability is narrower in job-level env than in step inputs
Rejected: Keep the job-level env shortcut | workflow never created any jobs on GitHub
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Run actionlint on workflow edits before pushing when using less-common context placements
Tested: actionlint on kernel-validator-e2e.yml
Not-tested: New GitHub run after this fix yet
The crash matcher was over-weighting generic reporting frames like print_report and dump_stack, which made real KASAN outputs look less similar than they are.

Filter those generic frames out of the stack-frame comparison and treat invalid-free as an acceptable match for the current double-free bucket. This improves the matcher without changing the transport or VM runner behavior.

Constraint: Real guest logs often include long KASAN/UBSAN reporting prologues before the interesting fault path
Rejected: Keep scoring the first raw stack frames verbatim | biased the oracle toward reporting machinery instead of the faulting path
Confidence: medium
Scope-risk: narrow
Reversibility: clean
Directive: If we later split invalid-free into its own taxonomy bucket, revisit the double-free pattern alias here
Tested: pnpm --filter @pwnkit/core test -- kernel-oracle.test.ts; pnpm build
Not-tested: Improvement to real-world crashMatch rate beyond the exercised invalid-free sample
@peaktwilight peaktwilight marked this pull request as ready for review April 13, 2026 10:07
@peaktwilight peaktwilight merged commit 452fa88 into main Apr 13, 2026
12 checks passed
@peaktwilight peaktwilight deleted the feat/kernel-vm-assets branch April 13, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant