Skip to content

Releases: PwnKit-Labs/pwnkit

v0.7.10 — three scan agent failure modes fixed

07 Apr 13:57

Choose a tag to compare

Three independent bugs in the scan agent, all found by running real scans against https://doruk.ch and https://demo.opensoar.app with the 0.7.9 verbose TUI on. Before this release the TUI clearly showed the agent was underperforming — but the root causes were invisible. Now they're fixed.

What was broken

1. Error summaries were silently clobbered with "reached max turns"

The scan TUI was showing internally inconsistent stage details:

✓ Attack  0 findings, First attempt (10 turns): no findings.
          Retry (5 turns): Agent reached max turns (10) without completing.

The "5 turns" was the real turn count — the retry loop broke on turn 5 due to an Azure API timeout. The "Agent reached max turns (10)" was the post-loop code in native-loop.ts unconditionally stomping the real error message. Now preserved, and a regression test forces the error-bail path and asserts the error survives.

2. Default mode for http(s) targets was deep, not web

Running pwnkit-cli scan --target https://example.com without --mode was quietly using the LLM/AI-agent attackPrompt — which gives zero web-pentest-specific guidance — against a plain web app. The agent fell back to http_request and bash with no strategy and ended up doing reconnaissance for 20 turns without actually attacking anything.

Now http(s) targets auto-default to mode: "web", which selects the shell-first shellPentestPrompt with phase-by-phase recon/injection/auth/info-disclosure guidance. mcp:// still defaults to mcp mode. --mode still overrides.

3. The attack agent had no guardrails against "bundle paralysis"

Real scans showed the agent spending 6-8 consecutive turns re-downloading and re-grepping a single minified JS bundle while ignoring the POST /api/v1/auth/login endpoint it had already discovered. The prompt said nothing about static-asset analysis budgets or auth-endpoint prioritization.

shellPentestPrompt now includes a new Efficiency discipline (avoid turn waste) section with four explicit rules:

  • Bundle paralysis: at most 2 turns of static-asset analysis per file, then pivot to live endpoints with curl / http_request
  • Passive-only recon: by turn ~4 you must have sent real exploit payloads, not just GETs
  • Auth endpoint neglect: if you discover a login endpoint, within 2 subsequent turns you MUST try (1) default/weak credentials, (2) SQL injection in the login body, (3) JWT none / kid-injection if a JWT is returned, (4) IDOR if a user-id appears in the response
  • Repeat-payload trap: do not re-send a failed payload later "to double-check"

Each rule has a dedicated assertion in prompts.test.ts so a future prompt refactor can't silently strip them out.

Tests

  • 360 unit tests passing (up from 358 in 0.7.9)
  • New native-loop regression test for the error-summary clobber
  • New shellPentestPrompt test locking in the four efficiency rules
  • CI install-smoke matrix green on Node 20/22/24 + Bun + Docker

Install

npx pwnkit-cli@latest
# or
bunx pwnkit-cli@latest

v0.7.9 — reveal the full stage-end story in the scan TUI

07 Apr 13:24

Choose a tag to compare

Why this exists

After 0.7.8 a hardened-target scan looked like this:

```
✓ Attack 0 findings, First attempt (10 turns): no findings. Retr...
```

…and users reasonably asked "it just stops, what happened?" The verbose toggle revealed all 40+ per-turn tool calls, but the TUI was still clipping the final stage-summary sentence at 55 characters — the one piece of text that actually explains why the scan ended.

Two fixes in 0.7.9

1. Unclipped stage details + verbose-aware rendering

The stage-end detail is now stored at full length in the scan TUI state. A new pure helper formatStageDetail(detail, verbose) in @pwnkit/core clips to 55 chars in compact mode and passes the full text through in verbose mode (v / Ctrl+O). Hit the toggle mid-scan and you can finally read sentences like:

First attempt (10 turns): no findings. Retry (10 turns): Agent reached max turns (10) without completing.

2. Post-scan Outcome: block

A new block renders under the summary bar once the scan finishes, showing each completed stage's terminal summary wrapped to the terminal width regardless of the verbose toggle. So when you get 0 findings on a hardened target, you now see:

```
──────────────────────────────────────
0 critical 0 high 0 medium 0 low 0 info
634.6s

Outcome:
Discover:
Agent reached max turns (12) without completing.
Attack:
0 findings, First attempt (10 turns): no findings. Retry (10 turns):
Agent reached max turns (10) without completing.
Report:
0 findings (0 confirmed)

v / ctrl+o verbose Press Enter, Esc, or q to close.
```

Turns "the scan just stops" into "here is exactly what the agent tried and why it stopped."

Tests

  • 358 passing (up from 347 in 0.7.8)
  • 11 new unit tests for normalizeStageEndDetail and formatStageDetail including the compact-clips-but-verbose-reveals regression guard
  • Verified end-to-end against a live Azure Responses API deployment

Install

```sh
npx pwnkit-cli@latest

or

bunx pwnkit-cli@latest
```

v0.7.8 — real scan activity visibility in the TUI

07 Apr 12:54

Choose a tag to compare

What's new

The 0.7.6 verbose toggle (v or Ctrl+O) finally shows real content. Two bugs on top of 0.7.6 were making it useless; this release fixes both.

The scan TUI now shows what the agent is actually doing

Before, verbose mode revealed an empty actions list even though the agent was churning through turns:

✓ Discover     Discovery turn 3
◌ Attack
◌ Verify
◌ Report
  verbose on

After, you get a live stream of per-turn tool calls with real arguments:

⠇ Discover     using API
    → turn 1: crawl: https://doruk.ch
    → turn 2: http_request: GET https://doruk.ch/robots.txt
    → turn 3: http_request: GET https://doruk.ch/.well-known/security.txt
⠼ Attack       using API
    → turn 1: bash: set -x; for path in / /robots.txt /api/ /openapi.json /.git/config; do curl -sI "$path"; done
    → turn 2: bash: nmap -sV --top-ports 100 doruk.ch
    → turn 3: http_request: POST https://doruk.ch/api/login
  verbose on

Compact mode (default) still only shows the last 3 actions per stage with 60-char truncation. Hit v or Ctrl+O to flip into verbose and reveal up to 40 actions per stage with 120-char rows.

The bugs this fixes

1. Per-turn events were marked as stage:end instead of stage:start

The agentic scanner was emitting stage:end events on every turn of the native-API loop. The TUI correctly interpreted stage:end as "this stage is done", so it marked Discover as ✓ after turn 1, then kept overwriting the detail text with "Discovery turn N" on every subsequent turn. No sub-action was ever stored, so the verbose toggle had nothing to reveal.

Fix: all five onTurn callsites (three native-API + two legacy subprocess runtimes) now emit stage:start sub-actions while the stage is running. stage:end is reserved for actual terminal events.

2. Tool call previews showed only the tool name, not its arguments

Even with the event type fixed, verbose mode showed turn 1: bash, bash — technically accurate but useless. You couldn't tell whether the agent was running curl, nmap, or whoami.

Fix: a new toolCallPreview() formatter in @pwnkit/core that extracts the most identifying argument per tool (command for bash, method url for http_request, url for crawl, [severity] title for save_finding, etc.) and produces a one-line preview. One sub-action is emitted per tool call instead of one per turn, so turns that ran multiple tools show each one individually.

Installation

npx pwnkit-cli@latest
# or
bunx pwnkit-cli@latest

Tests and validation

  • 347 unit tests passing (up from 327 in 0.7.6), +20 new tool-preview tests
  • Verified end-to-end against a live Azure Responses API deployment at rapidata-hackathon-resource.openai.azure.com
  • gh run install-smoke matrix green on Node 20/22/24 + Bun + Docker

v0.7.6 — Azure multi-turn fix + verbose TUI toggle

07 Apr 12:27

Choose a tag to compare

Fixes

Azure Responses API multi-turn scans (output_text bug)

Every multi-turn scan on Azure OpenAI has been silently failing at turn 2 or later with:

Azure OpenAI API error 400: Invalid value: 'input_text'.
Supported values are: 'output_text' and 'refusal'.

The agent loop replays the assistant's prior text reply on every turn to preserve context, and that replay was being serialized with the wrong content-type tag. In the TUI this looked like "Agent reached max turns without completing" with zero findings — the real cause was that every turn after the first was getting rejected by Azure. See 6598cb8.

Verified end-to-end against a live Azure deployment: a quick scan now completes 5 attack turns cleanly where the old code 400'd at turn 2.

WAL-mode SQLite auto-migration

Existing pwnkit.db files written by pre-0.7.1 builds were stored in SQLite WAL mode (file format bytes 18/19 = 2). node-sqlite3-wasm's Emscripten VFS can't read WAL-mode files, so upgrading users hit a confusing unable to open database file error on first run. The pwnkitDB constructor now detects WAL-mode headers and flips them to legacy rollback mode in place. All user data is preserved. Ships in 0.7.5 and carried forward.

Stale advisory-lock recovery

node-sqlite3-wasm implements SQLite's lock protocol as mkdirSync("${db}.lock") / rmdirSync — atomic on POSIX, but with no recovery path when the lock holder is killed with SIGKILL before the rmdir. Pre-0.7.1 this was hidden because better-sqlite3 used kernel byte-range locks. The constructor now detects stale lock directories older than 10 seconds and clears them with a one-line warning. Fresh lock directories are left alone, so legitimate concurrent writers are not clobbered. Ships in 0.7.5 and carried forward.

Features

Verbose TUI toggle (v or Ctrl+O)

Hit v or Ctrl+O during a scan, review, or audit to reveal the full turn-by-turn agent activity. Compact mode stays the default (last 3 actions per stage, 60-char truncation); verbose mode shows up to 40 actions per stage with a wider 120-char budget. Muscle memory matches codex CLI and Claude CLI. Toggle works mid-run, not just after the summary. See de3e85a.

All three commands (scan, review, audit) share a single Ink render path, so the toggle applies automatically to every mode.

Install / upgrade

```sh
npx pwnkit-cli@latest

or

bunx pwnkit-cli@latest
```

Pure-JS with a WASM SQLite core — no native bindings to rebuild, runs under Node 20/22/24, Bun, and Docker without a recompile dance.

Tests and CI

  • 327 unit tests passing, up from 305 in 0.7.5
  • New regression test for the Azure output_text bug walks a realistic multi-turn message stream and asserts role → type consistency
  • 22 new unit tests for the scan TUI state reducers (pure functions extracted to @pwnkit/core/scan-ui-state)
  • CI install-smoke matrix (Node 20/22/24 + Bun + Docker) now covers --help, doctor, history, mcp-server stdio handshake, and review source pipeline bootstrap — via the shared scripts/smoke-cli.sh harness

v0.7.3 — single source of truth for VERSION

07 Apr 11:46

Choose a tag to compare

Patch release (developer ergonomics)

v0.7.2 fixed the immediate `--version` regression with a lockstep approach: bump root `package.json` AND `@pwnkit/shared/constants.ts`, plus a regression test that catches drift after the fact. This release replaces that with a strictly better approach where drift is impossible at the source level.

What changed

  • `packages/shared/src/constants.ts` — `VERSION` is now sourced from root `package.json` via two paths:

    • Bundled mode (esbuild via the release pipeline): the bundler injects the version as a string literal at build time via `define`. Zero runtime fs cost.
    • Source / test mode (tsx, vitest, any unbundled flow): a one-time synchronous fs read of `../../../package.json` resolves the version at module load.
      Either way, root `package.json` is the single source of truth.
  • `scripts/bundle-cli.mjs` — reads root `package.json` once, passes the version into esbuild via `define: { PWNKIT_VERSION: JSON.stringify(version) }`.

  • `packages/core/src/version-sync.test.ts` — kept as a smoke check that catches anyone re-introducing a hardcoded constant. Trivially passes after this fix.

Future releases

Bumping the version is now exactly one edit: change root `package.json` `version`. The CLI `--version`, the published `dist/package.json`, and the GitHub release tag will all match automatically.

```bash
npx pwnkit-cli@latest --version

0.7.3

```

v0.7.2 — fix CLI --version reporting

07 Apr 11:41

Choose a tag to compare

Patch release

v0.7.1 shipped with the root `package.json` bumped to 0.7.1 but the `@pwnkit/shared` `VERSION` constant accidentally left at 0.7.0. Since the CLI's `--version` reads from the constant, `npx [email protected] --version` reported "0.7.0" — a credibility bug we can't republish 0.7.1 to fix. This is the patch.

What's in v0.7.2

  • Fixed: CLI `--version` correctly reports the current version
  • Added: `packages/core/src/version-sync.test.ts` — regression test that fails the build if root `package.json` and the `VERSION` constant ever drift again
  • Documented: lockstep requirement spelled out in a comment above the `VERSION` constant

Everything else from v0.7.1 still applies — pure-WASM SQLite, zero native modules, runs on Node 18 / 20 / 22 / 25 / Bun. See the v0.7.1 release notes for the full changelog.

Upgrade

```bash
npx pwnkit-cli@latest --version

0.7.2

```

v0.7.1 — pure-WASM SQLite, no native bindings

07 Apr 11:31

Choose a tag to compare

Headline

Pwnkit no longer ships any native modules. The SQLite persistence layer was migrated from better-sqlite3 to node-sqlite3-wasm, eliminating the entire NODE_MODULE_VERSION ABI mismatch class. The same binary now works on Node 18 / 20 / 22 / 25 / Bun without prebuilds, postinstall hacks, or rebuild dance.

npx [email protected] scan --target https://your-app.example.com
# or, ~10x faster cold start under Bun:
bunx [email protected] scan --target https://your-app.example.com

What's in this release

Core change — drop better-sqlite3, run on pure-WASM SQLite

  • packages/db/src/wasm-shim.ts (NEW, 240 lines) — wraps node-sqlite3-wasm and exposes the better-sqlite3 API surface that drizzle-orm's BetterSQLiteSession expects
  • packages/db/package.json — drops better-sqlite3 + @types/better-sqlite3, adds node-sqlite3-wasm@^0.8.55
  • packages/core/src/db/database.ts — deleted as dead code
  • scripts/verify-native.mjs — deleted (the postinstall ABI fixer is no longer needed)
  • scripts/bundle-cli.mjs — marks node-sqlite3-wasm external so the sidecar .wasm resolves via node_modules in the shipped tarball

New tools and playbooks

  • wp_fingerprint tool (#80) — autonomous WordPress plugin/theme detection with OSV CVE lookup. Enumerates from four parallel sources (HTML scrape, /wp-json/ namespaces, /wp-json/wp/v2/posts, /wp-content/{plugins,themes}/), reads each plugin's readme.txt and style.css for version, queries api.osv.dev for CVEs. Same-origin SSRF guard. Activated via --features wp_fingerprint. 10 unit tests, zero network in tests.
  • Anti-honeypot flag-shape validator (#82) — when the agent calls done with a flag-shaped string, the validator scores its inner content. Hex/UUID/UUID4/sha256 score high; sentences and decoy markers (script_kiddie, decoy, fake, honeypot) score below 0.5 and are rejected once with a hint to keep exploring. Catches the XBEN-079 honeypot decoy FLAG{I'm_a_Script_Kiddie} at confidence 0.05; accepts the real FLAG{7dda4e738d29a996f343a610} at 0.9. 26 new tests. Default ON, disable with --no-decoy-detection.

Statistical evaluation harness

  • --repeat N flag for the XBOW runner (#81) — runs each enabled challenge N independent times and reports per-attempt success rate with a 95% Wilson score confidence interval, mean turns, mean cost, and per-run breakdown. Wilson rather than Wald CI because N is small and rates can be near 0/1. Backed by a per-cell cost ceiling (--repeat-cost-ceiling-usd, default $5) so n=10 across the unsolved 8 stays under $40 per sweep. 15 new Wilson tests covering N=10 with 0/1/3/5/9/10 passes.
  • Methodology documentation (#84) — new docs page at docs.pwnkit.com/methodology explains the difference between best-of-N (what XBOW reports), per-attempt success rate (what `--repeat N` measures), and methodology disclosure as a moat. Walks through the XBEN-061 v1-vs-v2 regression-test story as a worked example.

Benchmark substrate disclosure

  • --benchmark-repo flag (#78) — the XBOW runner now accepts any XBOW-compatible benchmark fork as its source. Defaults to 0ca/xbow-validation-benchmarks-patched (community fork that fixes the 43 upstream Docker images that no longer build), but can run against xbow-engineering/validation-benchmarks (strict upstream), KeygraphHQ/xbow-validation-benchmarks, or any other fork for substrate disclosure / comparison. CI workflow gets matching benchmark_repo and benchmark_ref workflow_dispatch inputs.

Operator visibility

  • Azure region logging (#85) — when the runtime initializes the Azure provider, it probes the endpoint once for the x-ms-region response header and logs the resolved region in the startup banner. Cached process-wide, failure-tolerant, honors PWNKIT_REGION_OVERRIDE for tests, no-op for non-Azure providers.

Pre-recon hook

  • WordPress plugin CVE coverage in pre-recon (#83) — when the target probes positive for WordPress, pre-recon now auto-invokes runWpFingerprint() and folds the structured CVE leads into the agent's system prompt before the attack phase starts.

Investigation

  • XBEN-099 root cause (#79) — the persistent failure was misdiagnosed as Docker rot. The actual cause is a 60s timeout race in pwnkit's own xbow-runner.ts against the compose file's 30s mongo healthcheck interval. Documented at docs.pwnkit.com/research/xben-099-investigation. One-line fix is queued for the next release.

Verification

  • pnpm lint clean across all packages
  • pnpm --filter @pwnkit/core test303/303 passing across 19 test files
  • ✅ Fresh tarball install via npm install on Node 25 → scan reaches the agentic loop, writes a 250 KB SQLite database file, no native-binding crash
  • ✅ Fresh tarball install via bun add → same scan works under Bun, 250 KB DB written, no crash

Upgrade notes

  • Drop-in upgrade for npx / bunx users — no migration needed, the WASM shim presents the same drizzle API
  • For locally-cached ~/.npm/_npx users on 0.7.0 — recommended one-time rm -rf ~/.npm/_npx/* before the first 0.7.1 run to avoid stale cached prebuilds
  • The --features CLI flag is new — wires up wp_fingerprint, decoy_detection, etc. by setting the corresponding PWNKIT_FEATURE_<NAME> env var
  • Decoy detection is on by default — set PWNKIT_FEATURE_DECOY_DETECTION=0 or pass --no-decoy-detection to disable

Stats

  • 10 commits since v0.7.0
  • 1,800+ lines added across the agent, benchmark, runtime, audit, docs, and bundler
  • 8 GitHub issues closed (#78, #80, #81, #82, #83, #84, #85, plus the closed-at-merge cluster)
  • 38 new unit tests across wp-fingerprint, flag-validator, wilson, xbow-runner, llm-api, pre-recon-cve

v0.6.0 — FP Reduction Moat

06 Apr 08:52

Choose a tag to compare

pwnkit v0.6.0 — The FP Reduction Moat Release

114 commits since v0.5.0. Major shipping round across the agent, the benchmarks, the docs, and the website.

Headline numbers

  • XBOW: 87.5% (91/104) — up from 86.5% after cracking XBEN-042
  • Cybench: 80% (8/10) — first non-XBOW benchmark score, includes 1 Medium difficulty
  • npm-bench: F1 0.444 — first published score on the new 81-package benchmark
  • 188/188 core tests passing

The 11-layer FP reduction moat

Built the open-source equivalent of Endor Labs' commercial pipeline. Every finding now walks through up to 11 independent verification layers:

  1. Holding-it-wrong filter — rejects findings whose "vulnerability" is the function's documented purpose (packages/core/src/triage/holding-it-wrong.ts)
  2. 45-feature handcrafted extractor — for future XGBoost training (triage/feature-extractor.ts)
  3. Per-class oracles — SQLi, XSS, SSRF, RCE, path traversal, IDOR with concrete artifact extraction (triage/oracles.ts)
  4. Self-consistency voting — N=3 majority vote across verify runs (triage/verify-pipeline.ts)
  5. Reachability gate — Endor Labs-style "is this sink actually called?" check (triage/reachability.ts)
  6. Multi-modal cross-validation — pwnkit × foxguard agreement (triage/multi-modal.ts)
  7. PoV generation gate — empirical "if you can't build a working PoC in N turns, downgrade" (triage/pov-gate.ts)
  8. Assistant memories — Semgrep-style 2.8x FP reduction via per-target FP context (triage/memories.ts)
  9. Structured 4-step verify pipeline — reachability → payload → impact → exploit confirmation (triage/verify-pipeline.ts)
  10. Adversarial debate — prosecutor vs defender with skeptical judge (triage/adversarial.ts)
  11. EGATS evidence-gated tree search — MAPTA-style branch expansion (agent/egats.ts)

New benchmarks

  • Cybench runner — 40 CTF challenges from CSAW/picoCTF/HackTheBox via the Cybench academic benchmark (packages/benchmark/src/cybench-runner.ts)
  • AutoPenBench runner — 33 network/CVE pentesting tasks (CI workflow + Dockerfile patches for upstream EOL bases)
  • npm-bench expansion — 30 → 81 packages (17 new malicious from Socket.dev/Phylum, 17 new CVEs from NVD/GHSA, 17 new safe top-50 packages)

New features

  • Authenticated scanning--auth flag with bearer/cookie/basic/header (#27)
  • OpenAPI/Swagger import--api-spec flag pre-loads endpoint knowledge (#28)
  • Remediation guidance — auto-generated fix suggestions for 18+ vuln categories (#29)
  • PDF pentest reports--format pdf (#30)
  • GitHub Issues export--export github:owner/repo (#31)
  • Kali Docker executorPWNKIT_FEATURE_DOCKER_EXECUTOR (#34)
  • PTY session management — interactive shells for reverse shells, DB clients (#37)
  • OpenRouter multi-model runtime — ensemble + model rotation across 6 default models (#42)
  • Best-of-N strategy racing--race flag with 5 attack strategies (#25)
  • Triage CLIpwnkit triage memory add/list/remove/mark-fp for human-in-the-loop FP labeling
  • Per-scan cost tracking — USD estimates from token usage (#36)
  • Web search tool — DuckDuckGo with anti-cheat blocklist (#35)
  • Progress handoff — failed retry attempts inject prior findings into next attempt (#32)
  • 5 new playbooks — blind exploitation, CVE lookup, deserialization, request smuggling, creative IDOR

Pre-built Docker image

docker run ghcr.io/peaktwilight/pwnkit:latest scan --target https://example.com

Multi-stage Dockerfile (builder + runtime), Kali base, full pentest toolset, Playwright + Chromium, ~710MB. Auto-builds on tag push (#50).

Triage system architecture

The structured verify pipeline can now be wired into the scan flow via feature flags. Early benchmarks show the 11-layer stack approaches commercial parity (Endor 95%, Semgrep 96% — pwnkit's pipeline is now structurally equivalent, training data accumulating).

Documentation

  • New research page: docs/src/content/docs/research/fp-reduction-moat.md — full stack ordering, FP reduction per layer, citations to FalseCrashReducer, MAPTA, Anthropic Debate, IBM D2A, VulnBERT
  • New page: docs/src/content/docs/triage.md — 11-layer pipeline with branch points
  • New page: docs/src/content/docs/features.md — comprehensive feature catalog
  • New page: docs/src/content/docs/recipes.md — real-world scenarios
  • Mermaid diagrams across architecture, agent-loop, fp-reduction-moat
  • Doc accuracy audit fixed 8 categories of stale CLI/flag/file references

Website redesign

  • 13 sections culled to 8 essential ones
  • Apple-style positioning (no defensive competitor mentions in copy)
  • Ghostty warm gray palette (no more blue tint)
  • Hero stat: massive 87.5% in JetBrains Mono
  • TrustBar: only the 3 benchmarks we have published scores for
  • Inline SVG hero ghost (replaces heavy WebGL React Three Fiber)
  • shadcn/ui installed with radix-luma base + migration plan
  • 14 named section components instead of 800-line monolith

Bug fixes

  • Azure OpenAI detection in audit/review (was silently skipping AI when only AZURE_OPENAI_API_KEY was set)
  • OpenRouter priority over Azure in CI (caused all benchmarks to route to OpenRouter and 403 once free tier exhausted)
  • XBOW runner hang after benchmark completion (Playwright browser processes kept event loop alive — process.exit(0) defensive fix)
  • AutoPenBench: docker-compose v1 → v2 shim, eclipse-temurin Dockerfile patches
  • Mermaid diagram rendering enabled in docs

Stats

  • ~25,000 lines of new code
  • 188/188 core tests passing (up from 62 at session start)
  • 17 GitHub issues closed in this release
  • 17 GitHub issues opened for the next round (FP improvements, benchmark expansions, design polish)

Install

npx pwnkit-cli@latest scan --target https://example.com

Or pull the Docker image:

docker pull ghcr.io/peaktwilight/pwnkit:0.6.0

Full commit list

See git log v0.5.0..v0.6.0 for the complete history (114 commits).

v0.5.0 — Playwright, White-Box, 5 Benchmarks

04 Apr 20:43

Choose a tag to compare

Highlights

Playwright browser tool — XSS testing via headless Chromium. Dialog capture detects alert/confirm/prompt for confirmed XSS. Optional dependency, auto-detected at runtime. Unlocks 23 XBOW XSS challenges.

White-box mode--repo <path> gives the agent source code access. Cracked 3 "impossible" challenges that no black-box approach could solve (XBEN-042, 034, 054). Like Shannon at 96%.

5 benchmark suites covering every domain pwnkit supports:

  • AI/LLM security: 100% (10/10)
  • XBOW web pentesting: 35 flags (73% local)
  • AutoPenBench: 33 network/CVE tasks (runner built)
  • HarmBench: 510 LLM safety behaviors (runner built)
  • npm audit: 30 packages — first npm security benchmark

Shell-first validated — 3 tools (bash, save_finding, done) outperform 10 structured tools. A/B tested and documented.

Install

npx pwnkit-cli

# With Playwright for XSS
npm i -g pwnkit-cli
npm i playwright && npx playwright install chromium
pwnkit scan --target https://example.com --mode web

Full changelog

73 commits. See docs at https://docs.pwnkit.com

v0.4.2 — Shell-First Web Pentesting, 70% on XBOW

04 Apr 00:40

Choose a tag to compare

Shell-first pentesting

pwnkit now uses a shell-first approach for web pentesting. Instead of structured tools (crawl, submit_form, http_request), the agent gets shell_exec — run any bash command. The agent writes curl commands, Python exploit scripts, and chains operations like a real pentester.

Why: Structured tools failed on XBOW challenges (0/10 flag extractions in 20+ turns). Shell-first cracked 7/10 in an average of 10 turns. The model knows curl from training data — no need to learn a custom tool API.

XBOW benchmark results

7/10 buildable challenges cracked (70%):

  • IDOR (10 turns), SSTI (5 turns), deserialization (4 turns), file upload (12 turns), blind SQLi (20 turns), privilege escalation (9 turns), markdown injection (10 turns)

For context: KinoSec scores 92.3%, XBOW scores 85%.

Other changes

  • Philosophy documented at docs.pwnkit.com/philosophy
  • Shell-first wired as default web pentesting mode in the scanner
  • Blog post: "why we gave our agent a terminal instead of tools"

Install

npx pwnkit-cli