Skip to content

feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster)#19189

Open
kshitijk4poor wants to merge 5 commits intomainfrom
feat/browser-coordinate-click
Open

feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster)#19189
kshitijk4poor wants to merge 5 commits intomainfrom
feat/browser-coordinate-click

Conversation

@kshitijk4poor
Copy link
Copy Markdown
Collaborator

@kshitijk4poor kshitijk4poor commented May 3, 2026

What this PR does

Adds optional x/y parameters to browser_click for viewport-coordinate clicking via CDP Input.dispatchMouseEvent. The browser's own compositor handles hit-testing, so clicks bypass all DOM selector machinery — they pass through iframes, shadow DOM, canvas elements, and anything else the accessibility tree can't reach.

This PR also introduces a 4-tier dispatch architecture that, on the common path (after browser_navigate), brings coordinate click latency to parity with ref-based clicking.


Problem

Ref-based clicking (browser_click(ref="@e5")) fails or is unreliable for:

Scenario Ref-based Coordinate (x, y)
Cross-origin iframes (OOPIFs) ✗ refs don't reach in ✓ compositor routes to correct renderer
Closed shadow DOM ✗ Playwright can't pierce ✓ bypasses DOM tree entirely
Canvas / WebGL elements ✗ no DOM nodes = no refs ✓ screenshot → coordinate
Dynamic overlays / popups May be stale by click time ✓ hits whatever is visually on top
Custom web components Framework event handling varies ✓ OS-level input, fires all handlers

API

browser_click(x=150, y=300)        # compositor-level click at viewport coords
browser_click(ref="@e5")            # existing ref click — completely unchanged

ref and x+y are mutually exclusive. Providing both, or partial coordinates, returns a clear error. Neither is now an error too (previously ref was required).


Architecture: 4-tier dispatch

Researched Playwright, Puppeteer, browser-harness (browser-use), and Vercel's agent-browser source to inform every decision here.

browser_click(x, y)
    │
    ├─ 1. Supervisor path ──────── CDPSupervisor already running for this task_id?
    │      (fastest)                Yes → dispatch_mouse_click() on its live WS
    │                               Zero WS connection cost; supervisor started by browser_navigate
    │
    ├─ 2. Warm-cache path ───────── No supervisor, but session cached from prior click?
    │      (fast)                   Yes → open 1 WS, send press+release (skip getTargets)
    │
    ├─ 3. Cold-cache path ───────── First click on this endpoint
    │      (moderate)               Open 1 WS, resolve session, send press+release
    │
    └─ 4. agent-browser fallback ── No CDP endpoint configured
           (always available)       3× subprocess mouse move/down/up IPC calls

Tier 1 — Supervisor path (new)

CDPSupervisor (already in browser_supervisor.py) maintains a persistent, self-healing WebSocket per task_id. It exists for dialog detection and frame tracking — this PR extends it with dispatch_mouse_click().

Dispatch model: asyncio.run_coroutine_threadsafe onto the supervisor's background loop. mousePressed + mouseReleased are submitted as concurrent futures via asyncio.gather — both sent before either is awaited (Playwright Promise.all pattern). No serial round-trips.

# Added to CDPSupervisor:
def dispatch_mouse_click(self, x, y, button="left", timeout=10.0) -> None

Tier 2+3 — Per-click WS (session-cached)

When no supervisor is running (e.g. raw CDP without browser_navigate), opens a single WebSocket for the entire click sequence and reuses it for all CDP messages. Three optimizations derived from Playwright/Puppeteer/browser-harness research:

  1. Single connection — one TCP+WS handshake for all messages (getTargets + attachToTarget + mousePressed + mouseReleased). Eliminates 2 of the 3 connection setups vs the naïve approach.
  2. Session ID cachingTarget.getTargets + Target.attachToTarget results cached in _CDP_SESSION_CACHE (keyed by endpoint URL). Second+ clicks skip session negotiation entirely. Self-heals on stale-session errors after navigation.
  3. Skip mousePressed ack — fires both mouse events before awaiting either response. Since the browser processes CDP messages sequentially within a session, if mouseReleased is acknowledged then mousePressed has already been applied. Saves one RTT. Same pattern as Playwright's Mouse.click() and Puppeteer's concurrent down+up dispatch.

Benchmark

Real Lightpanda WebSocket at ws://127.0.0.1:63372/, 300 iterations:

Approach Mean Median Min p95
Baseline (original PR: 3 separate connections) 4.86ms 3.69ms 2.69ms 10.49ms
Warm cache (1 conn, session cached) 1.30ms 1.11ms 0.96ms 2.30ms
Supervisor (persistent WS) 0.20ms 0.19ms 0.16ms 0.33ms
Ref-click IPC baseline 0.14ms 0.13ms 0.10ms 0.19ms

23.75× speedup over the original 3-connection approach. The supervisor path (the common case after browser_navigate) is 1.5× ref-click — 0.07ms overhead, which is one cross-thread future dispatch.

Benchmark script: scripts/benchmark_click_paths.py (runnable against any live CDP endpoint).


Optimizations sourced from research

Studied: Playwright (crConnection.ts, input.ts), Puppeteer (Connection.ts, Input.ts, TargetManager.ts), browser-harness (daemon.py, helpers.py, _ipc.py), Vercel agent-browser (cdp/client.rs, interaction.rs, browser.rs).

# Optimization Sourced from Applied
Single WS per click All harnesses
Skip mousePressed ack Playwright Promise.all, Puppeteer concurrent down+up
Session ID caching with stale-session self-heal browser-harness daemon session_id + retry once
compression=None on WS Puppeteer perMessageDeflate: false
Reuse existing persistent WS (supervisor) browser-harness daemon architecture ✓ (via CDPSupervisor)
Skip mouseMoved before click Vercel agent-browser sends 3 events; we send 2 ✓ already — we never sent it
TCP_NODELAY Researched; tested +0.01ms on pipelined path — not worth adding

Files changed

File What changed
tools/browser_tool.py _CDP_SESSION_CACHE, _cdp_resolve_session(), _cdp_coordinate_click_async() (all 4 optimizations), _cdp_coordinate_click() (3-tier dispatch), updated browser_click() signature + validation, schema update (x/y params, ref no longer required)
tools/browser_supervisor.py CDPSupervisor.dispatch_mouse_click() — sync bridge with asyncio.gather pipelining
tests/tools/test_browser_coordinate_click.py 27 tests: input validation (6), CDP path via mock server (4), agent-browser fallback (4), ref preservation (2), schema (3), registry (2), session caching (3), supervisor path (3)
scripts/benchmark_click_paths.py 4-tier real-browser benchmark

Commit history

ff8c6f2  feat: add compositor-level coordinate click to browser_click
0bfab1d  perf: batch CDP click into single WS connection (2.4x speedup)
451c55b  perf: session ID caching + skip mousePressed ack (browser-harness/Playwright patterns)
aef97da  perf: reuse supervisor's persistent WS for coordinate clicks (23x speedup)

@kshitijk4poor kshitijk4poor added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets labels May 3, 2026
@kshitijk4poor kshitijk4poor force-pushed the feat/browser-coordinate-click branch from 61263bc to 15c75b1 Compare May 3, 2026 11:30
@alt-glitch alt-glitch added P3 Low — cosmetic, nice to have tool/browser Browser automation (CDP, Playwright) labels May 3, 2026
Add optional x/y parameters to browser_click for viewport-coordinate
clicking via CDP Input.dispatchMouseEvent. When coordinates are provided,
clicks are dispatched at the browser compositor level — Chrome does its own
hit-testing, bypassing DOM selectors entirely.

Use cases where ref-based click fails but coordinate click works:
- Cross-origin iframes (OOPIFs)
- Closed shadow DOM
- Canvas/WebGL elements
- Dynamic overlays where the snapshot may be stale

Implementation:
- CDP path (preferred): Input.dispatchMouseEvent via WebSocket
  (Target.getTargets + mousePressed + mouseReleased)
- agent-browser fallback: mouse move/down/up when no CDP endpoint available
- ref is no longer required — either ref OR x+y must be provided

Benchmark (real Lightpanda WS at ws://127.0.0.1:63372, 200 iterations):
  CDP coord click:         3.71ms mean (2.97ms median, 2.61ms min, 7.01ms p95)
  Single WS conn baseline: 1.57ms mean (cost per connection open+call)
  agent-browser IPC:       0.20ms mean per HTTP call

The 3.71ms per CDP click comes from 3 sequential fresh WS connections
(pre-existing architecture in browser_cdp_tool.py). A persistent WS
connection pool would bring this to ~3.1ms (just the 2 mouse events).
Both paths are well under the 100ms human perception threshold.

Files:
- tools/browser_tool.py: schema update (x/y, ref no longer required),
  _cdp_coordinate_click(), _coordinate_click_via_agent_browser(),
  updated browser_click() with validation and dispatch
- tests/tools/test_browser_coordinate_click.py: 21 tests covering
  validation, CDP path, fallback path, ref preservation, schema, registry
- scripts/benchmark_click_paths.py: real-browser latency benchmark
@kshitijk4poor kshitijk4poor force-pushed the feat/browser-coordinate-click branch from 15c75b1 to ff8c6f2 Compare May 7, 2026 04:27
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

🔎 Lint report: feat/browser-coordinate-click vs origin/main

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7467 on HEAD, 7447 on base (🆕 +20)

🆕 New issues (19):

Rule Count
invalid-argument-type 11
invalid-assignment 4
unresolved-import 3
unresolved-attribute 1
First entries
scripts/benchmark_click_paths.py:151: [invalid-assignment] invalid-assignment: Object of type `() -> Literal[False]` is not assignable to attribute `_is_camofox_mode` of type `def is_camofox_mode() -> bool`
tools/browser_tool.py:2398: [unresolved-attribute] unresolved-attribute: Attribute `connect` is not defined on `None` in union `Unknown | None`
tests/tools/test_browser_coordinate_click.py:18: [unresolved-import] unresolved-import: Cannot resolve imported module `websockets.asyncio.server`
tests/tools/test_browser_coordinate_click.py:428: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["required"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["y"]` on object of type `str`
scripts/benchmark_click_paths.py:197: [invalid-assignment] invalid-assignment: Object of type `bound method _SupervisorRegistry.get(task_id: str) -> CDPSupervisor | None` is not assignable to attribute `get` of type `def get(self, task_id: str) -> CDPSupervisor | None`
tools/browser_tool.py:3720: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `bound method dict[str, str | dict[str, dict[str, str]]].__getitem__(key: str, /) -> str | dict[str, dict[str, str]]` cannot be called with key of type `slice[None, Literal[60], None]` on object of type `dict[str, str | dict[str, dict[str, str]]]`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> str, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[str]]` cannot be called with key of type `Literal["y"]` on object of type `list[str]`
tests/tools/test_browser_coordinate_click.py:17: [unresolved-import] unresolved-import: Cannot resolve imported module `websockets`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> str, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[str]]` cannot be called with key of type `Literal["x"]` on object of type `list[str]`
tests/tools/test_browser_coordinate_click.py:177: [invalid-argument-type] invalid-argument-type: Argument to function `browser_click` is incorrect: Expected `int | float | None`, found `Literal["abc"]`
tests/tools/test_browser_coordinate_click.py:434: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["properties"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> Unknown, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[Unknown]]` cannot be called with key of type `Literal["x"]` on object of type `list[Unknown]`
scripts/benchmark_click_paths.py:177: [invalid-assignment] invalid-assignment: Object of type `() -> Literal["ws://127.0.0.1:63372/"]` is not assignable to attribute `_resolve_cdp_endpoint` of type `def _resolve_cdp_endpoint() -> str`
tests/tools/test_browser_coordinate_click.py:177: [invalid-argument-type] invalid-argument-type: Argument to function `browser_click` is incorrect: Expected `int | float | None`, found `Literal["def"]`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["x"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> Unknown, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[Unknown]]` cannot be called with key of type `Literal["y"]` on object of type `list[Unknown]`
scripts/benchmark_click_paths.py:181: [invalid-assignment] invalid-assignment: Object of type `(tid) -> None` is not assignable to attribute `get` of type `def get(self, task_id: str) -> CDPSupervisor | None`

✅ Fixed issues: none

Unchanged: 3913 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Replace the 3-separate-_cdp_call() approach (one WS connection per
message) with a single _cdp_coordinate_click_async() coroutine that
opens the WebSocket once and sequences all CDP messages on it:

  1. Target.getTargets
  2. Target.attachToTarget (if page target found)
  3. Input.dispatchMouseEvent (mousePressed)  } pipelined — both sent
  4. Input.dispatchMouseEvent (mouseReleased) } before awaiting either

Benchmark vs real Lightpanda WS at ws://127.0.0.1:63372/ (300 iters):

  Baseline  (current main, 3 connections): 3.14ms mean, 2.97ms median
  Optimized (this commit, 1 connection):   1.30ms mean, 1.11ms median
  Speedup: 2.42x mean, 2.68x median, 1.62x p95

The savings come entirely from eliminating 2 TCP+WS handshakes.
mousePressed + mouseReleased are pipelined on the same connection,
so they travel in the same network burst.

21/21 tests pass.
…ywright patterns)

Two additional optimizations from researching Playwright, Puppeteer, and
browser-harness source:

SESSION ID CACHING (browser-harness daemon pattern)
  Target.getTargets + Target.attachToTarget are stable across clicks on the
  same page. Cache the resolved session_id keyed by CDP endpoint URL.
  Subsequent clicks skip straight to mousePressed+mouseReleased with no
  session negotiation overhead.

  Self-healing: on 'Session with given id not found' (stale after navigation),
  the cache is invalidated and session resolution runs once before retrying.
  This matches the exact retry pattern from browser-harness's daemon.handle().

SKIP mousePressed ACK (Playwright Promise.all pattern)
  Browser processes CDP messages sequentially within a session. If
  mouseReleased is acknowledged, mousePressed was already processed.
  We skip waiting for the press ack entirely, saving one RTT. This is
  the same pattern as Playwright's Mouse.click() using Promise.all and
  Puppeteer's concurrent down+up dispatch.

COMPRESSION=NONE (Puppeteer NodeWebSocketTransport pattern)
  Small CDP messages (Input.dispatchMouseEvent payloads are ~80 bytes)
  don't benefit from per-message compression. Disable it explicitly.
  Puppeteer uses perMessageDeflate: false for the same reason.

Benchmark vs real Lightpanda WS (300 iterations):
  Baseline (3 connections):       3.28ms mean
  Optimized cold cache (1 conn):  1.17ms mean  (2.79x speedup)
  Optimized warm cache (1 conn):  1.17ms mean  (2.82x speedup)

The cold/warm delta is <0.01ms because getTargets+attachToTarget on an
already-open socket costs almost nothing on localhost — the dominant cost
is WS connection setup, which we eliminated in the previous commit.
The session cache still removes real work (2 CDP round-trips) and prevents
accumulating latency on remote/higher-latency CDP endpoints.

Tests: 24 passed (21 existing + 3 new session caching tests)
…edup)

The CDPSupervisor (browser_supervisor.py) already maintains a persistent
WebSocket connection per task_id for dialog detection and frame tracking.
After browser_navigate(), a supervisor is always running with an open WS.
Instead of opening a new connection per click, dispatch directly on it.

Changes:
- browser_supervisor.py: add CDPSupervisor.dispatch_mouse_click() — sync
  bridge onto the supervisor's asyncio loop via run_coroutine_threadsafe.
  Pipelines mousePressed + mouseReleased via asyncio.gather (Playwright
  Promise.all pattern), no serial round-trips.
- browser_tool.py: _cdp_coordinate_click() now checks
  SUPERVISOR_REGISTRY.get(task_id) first; falls back to per-click WS
  connect if no supervisor is running (e.g. raw CDP without navigate).

Dispatch priority (fastest first):
  1. Supervisor path  — zero WS connection cost (supervisor WS already open)
  2. Warm-cache path  — 1 WS open + 2 mouse events (session cached)
  3. Cold-cache path  — 1 WS open + getTargets + attachToTarget + 2 events
  4. agent-browser    — 3 subprocess IPC calls (no CDP endpoint configured)

Benchmark vs real Lightpanda WS at ws://127.0.0.1:63372/ (300 iterations):
  Baseline   (3 connections):          4.86ms mean
  Warm cache (1 conn + cache):         1.30ms mean   (3.74x)
  Supervisor (persistent WS):          0.20ms mean  (23.75x)
  Ref-click IPC baseline:              0.14ms mean  (parity)

The supervisor path is 1.5x ref-click (0.07ms overhead) — essentially
the cost of one cross-thread future dispatch.

27/27 tests pass (+3 new TestSupervisorPath tests).
@kshitijk4poor kshitijk4poor changed the title feat: add compositor-level coordinate click to browser_click feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster) May 7, 2026
…enchmark path

Self-review findings addressed:

- browser_tool.py: log swallowed supervisor error at DEBUG instead of bare
  'pass' (was silent, triggered F841 for unused 'exc' variable). Renamed
  to '_exc' to signal intentional discard.
- browser_tool.py: rename unused 'press_id' to '_press_id' in both normal
  and retry paths (mouseReleased-only wait is intentional; press_id is never
  used after send).
- browser_tool.py: get_event_loop() → get_running_loop() in 3 locations
  inside _cdp_resolve_session and _cdp_coordinate_click_async. Both are
  async functions and get_event_loop() is deprecated in async context in
  Python 3.10+.
- browser_supervisor.py: ensure_future → create_task in dispatch_mouse_click.
  create_task is the correct modern API when already inside a running
  coroutine; ensure_future is deprecated for coroutines in Python 3.10+.
  Also consistent with the rest of browser_supervisor.py which uses
  create_task exclusively everywhere else.
- scripts/benchmark_click_paths.py: replace hardcoded /private/tmp/hermes-
  coord-click sys.path hack with __file__-relative repo root detection so
  the script works from any checkout location.

27/27 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp/tools Tool registry, model_tools, toolsets P3 Low — cosmetic, nice to have tool/browser Browser automation (CDP, Playwright) type/feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants