What
Tracking issue for the joint research paper between Doruk (pwnkit) and Guanni Qu (VulnBERT, Pebblebed Research Resident). Working title:
Agentic exploitation as a labeling oracle for vulnerability triage models
The thesis
The training data for a high-precision vulnerability triage classifier only exists if you have BOTH:
- an agentic exploit harness running at scale (pwnkit's role) that produces `(finding, attempted-PoC, real-or-fake)` tuples
- a labeled-classifier training pipeline (VulnBERT-class hybrid features-plus-embeddings architecture) that can consume those tuples and produce a small specialized model
Neither half exists meaningfully without the other. The dataset itself — not the model — is the moat. This is the inverse of the standard "dataset is given, model is the asset" framing in security ML.
What's already shipped
What's still needed for the paper
Honest negative results to include
- Handcrafted features designed for web-exploit findings (SQL errors, payload reflection, stack traces) are mostly zero on npm supply-chain findings — a publishable insight on domain transferability
- `label_source: package_verdict` is coarser than per-finding labels and produces some false-FPs on safe packages with legitimate findings — quantify the noise floor
Sequencing
This issue intentionally has no deadline. The work is the moat and the research direction is the long game.
Why this issue exists
So the work is in the open and not in someone's notebook. Discoverable, citable, mergeable.
What
Tracking issue for the joint research paper between Doruk (pwnkit) and Guanni Qu (VulnBERT, Pebblebed Research Resident). Working title:
The thesis
The training data for a high-precision vulnerability triage classifier only exists if you have BOTH:
Neither half exists meaningfully without the other. The dataset itself — not the model — is the moat. This is the inverse of the standard "dataset is given, model is the asset" framing in security ML.
What's already shipped
What's still needed for the paper
Honest negative results to include
Sequencing
This issue intentionally has no deadline. The work is the moat and the research direction is the long game.
Why this issue exists
So the work is in the open and not in someone's notebook. Discoverable, citable, mergeable.