feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster) by kshitijk4poor · Pull Request #19189 · NousResearch/hermes-agent

kshitijk4poor · 2026-05-03T11:25:30Z

What this PR does

Adds optional x/y parameters to browser_click for viewport-coordinate clicking via CDP Input.dispatchMouseEvent. The browser's own compositor handles hit-testing, so clicks bypass all DOM selector machinery — they pass through iframes, shadow DOM, canvas elements, and anything else the accessibility tree can't reach.

This PR also introduces a 4-tier dispatch architecture that, on the common path (after browser_navigate), brings coordinate click latency to parity with ref-based clicking.

Problem

Ref-based clicking (browser_click(ref="@e5")) fails or is unreliable for:

Scenario	Ref-based	Coordinate (x, y)
Cross-origin iframes (OOPIFs)	✗ refs don't reach in	✓ compositor routes to correct renderer
Closed shadow DOM	✗ Playwright can't pierce	✓ bypasses DOM tree entirely
Canvas / WebGL elements	✗ no DOM nodes = no refs	✓ screenshot → coordinate
Dynamic overlays / popups	May be stale by click time	✓ hits whatever is visually on top
Custom web components	Framework event handling varies	✓ OS-level input, fires all handlers

API

browser_click(x=150, y=300)        # compositor-level click at viewport coords
browser_click(ref="@e5")            # existing ref click — completely unchanged

ref and x+y are mutually exclusive. Providing both, or partial coordinates, returns a clear error. Neither is now an error too (previously ref was required).

Architecture: 4-tier dispatch

Researched Playwright, Puppeteer, browser-harness (browser-use), and Vercel's agent-browser source to inform every decision here.

browser_click(x, y)
    │
    ├─ 1. Supervisor path ──────── CDPSupervisor already running for this task_id?
    │      (fastest)                Yes → dispatch_mouse_click() on its live WS
    │                               Zero WS connection cost; supervisor started by browser_navigate
    │
    ├─ 2. Warm-cache path ───────── No supervisor, but session cached from prior click?
    │      (fast)                   Yes → open 1 WS, send press+release (skip getTargets)
    │
    ├─ 3. Cold-cache path ───────── First click on this endpoint
    │      (moderate)               Open 1 WS, resolve session, send press+release
    │
    └─ 4. agent-browser fallback ── No CDP endpoint configured
           (always available)       3× subprocess mouse move/down/up IPC calls

Tier 1 — Supervisor path (new)

CDPSupervisor (already in browser_supervisor.py) maintains a persistent, self-healing WebSocket per task_id. It exists for dialog detection and frame tracking — this PR extends it with dispatch_mouse_click().

Dispatch model: asyncio.run_coroutine_threadsafe onto the supervisor's background loop. mousePressed + mouseReleased are submitted as concurrent futures via asyncio.gather — both sent before either is awaited (Playwright Promise.all pattern). No serial round-trips.

# Added to CDPSupervisor:
def dispatch_mouse_click(self, x, y, button="left", timeout=10.0) -> None

Tier 2+3 — Per-click WS (session-cached)

When no supervisor is running (e.g. raw CDP without browser_navigate), opens a single WebSocket for the entire click sequence and reuses it for all CDP messages. Three optimizations derived from Playwright/Puppeteer/browser-harness research:

Single connection — one TCP+WS handshake for all messages (getTargets + attachToTarget + mousePressed + mouseReleased). Eliminates 2 of the 3 connection setups vs the naïve approach.
Session ID caching — Target.getTargets + Target.attachToTarget results cached in _CDP_SESSION_CACHE (keyed by endpoint URL). Second+ clicks skip session negotiation entirely. Self-heals on stale-session errors after navigation.
Skip mousePressed ack — fires both mouse events before awaiting either response. Since the browser processes CDP messages sequentially within a session, if mouseReleased is acknowledged then mousePressed has already been applied. Saves one RTT. Same pattern as Playwright's Mouse.click() and Puppeteer's concurrent down+up dispatch.

Benchmark

Real Lightpanda WebSocket at ws://127.0.0.1:63372/, 300 iterations:

Approach	Mean	Median	Min	p95
Baseline (original PR: 3 separate connections)	4.86ms	3.69ms	2.69ms	10.49ms
Warm cache (1 conn, session cached)	1.30ms	1.11ms	0.96ms	2.30ms
Supervisor (persistent WS)	0.20ms	0.19ms	0.16ms	0.33ms
Ref-click IPC baseline	0.14ms	0.13ms	0.10ms	0.19ms

23.75× speedup over the original 3-connection approach. The supervisor path (the common case after browser_navigate) is 1.5× ref-click — 0.07ms overhead, which is one cross-thread future dispatch.

Benchmark script: scripts/benchmark_click_paths.py (runnable against any live CDP endpoint).

Optimizations sourced from research

Studied: Playwright (crConnection.ts, input.ts), Puppeteer (Connection.ts, Input.ts, TargetManager.ts), browser-harness (daemon.py, helpers.py, _ipc.py), Vercel agent-browser (cdp/client.rs, interaction.rs, browser.rs).

#	Optimization	Sourced from
Single WS per click	All harnesses	✓
Skip `mousePressed` ack	Playwright `Promise.all`, Puppeteer concurrent down+up	✓
Session ID caching with stale-session self-heal	browser-harness daemon `session_id` + retry once	✓
`compression=None` on WS	Puppeteer `perMessageDeflate: false`	✓
Reuse existing persistent WS (supervisor)	browser-harness daemon architecture	✓ (via CDPSupervisor)
Skip `mouseMoved` before click	Vercel agent-browser sends 3 events; we send 2	✓ already — we never sent it
TCP_NODELAY	Researched; tested +0.01ms on pipelined path — not worth adding	✗

Files changed

File	What changed
`tools/browser_tool.py`	`_CDP_SESSION_CACHE`, `_cdp_resolve_session()`, `_cdp_coordinate_click_async()` (all 4 optimizations), `_cdp_coordinate_click()` (3-tier dispatch), updated `browser_click()` signature + validation, schema update (x/y params, ref no longer required)
`tools/browser_supervisor.py`	`CDPSupervisor.dispatch_mouse_click()` — sync bridge with `asyncio.gather` pipelining
`tests/tools/test_browser_coordinate_click.py`	27 tests: input validation (6), CDP path via mock server (4), agent-browser fallback (4), ref preservation (2), schema (3), registry (2), session caching (3), supervisor path (3)
`scripts/benchmark_click_paths.py`	4-tier real-browser benchmark

Commit history

ff8c6f2  feat: add compositor-level coordinate click to browser_click
0bfab1d  perf: batch CDP click into single WS connection (2.4x speedup)
451c55b  perf: session ID caching + skip mousePressed ack (browser-harness/Playwright patterns)
aef97da  perf: reuse supervisor's persistent WS for coordinate clicks (23x speedup)

Add optional x/y parameters to browser_click for viewport-coordinate clicking via CDP Input.dispatchMouseEvent. When coordinates are provided, clicks are dispatched at the browser compositor level — Chrome does its own hit-testing, bypassing DOM selectors entirely. Use cases where ref-based click fails but coordinate click works: - Cross-origin iframes (OOPIFs) - Closed shadow DOM - Canvas/WebGL elements - Dynamic overlays where the snapshot may be stale Implementation: - CDP path (preferred): Input.dispatchMouseEvent via WebSocket (Target.getTargets + mousePressed + mouseReleased) - agent-browser fallback: mouse move/down/up when no CDP endpoint available - ref is no longer required — either ref OR x+y must be provided Benchmark (real Lightpanda WS at ws://127.0.0.1:63372, 200 iterations): CDP coord click: 3.71ms mean (2.97ms median, 2.61ms min, 7.01ms p95) Single WS conn baseline: 1.57ms mean (cost per connection open+call) agent-browser IPC: 0.20ms mean per HTTP call The 3.71ms per CDP click comes from 3 sequential fresh WS connections (pre-existing architecture in browser_cdp_tool.py). A persistent WS connection pool would bring this to ~3.1ms (just the 2 mouse events). Both paths are well under the 100ms human perception threshold. Files: - tools/browser_tool.py: schema update (x/y, ref no longer required), _cdp_coordinate_click(), _coordinate_click_via_agent_browser(), updated browser_click() with validation and dispatch - tests/tools/test_browser_coordinate_click.py: 21 tests covering validation, CDP path, fallback path, ref preservation, schema, registry - scripts/benchmark_click_paths.py: real-browser latency benchmark

github-actions · 2026-05-07T04:29:04Z

🔎 Lint report: `feat/browser-coordinate-click` vs `origin/main`

ruff

Total: 0 on HEAD, 0 on base (➖ 0)

🆕 New issues: none

✅ Fixed issues: none

Unchanged: 0 pre-existing issues carried over.

ty (type checker)

Total: 7467 on HEAD, 7447 on base (🆕 +20)

🆕 New issues (19):

Rule	Count
`invalid-argument-type`	11
`invalid-assignment`	4
`unresolved-import`	3
`unresolved-attribute`	1

First entries

scripts/benchmark_click_paths.py:151: [invalid-assignment] invalid-assignment: Object of type `() -> Literal[False]` is not assignable to attribute `_is_camofox_mode` of type `def is_camofox_mode() -> bool`
tools/browser_tool.py:2398: [unresolved-attribute] unresolved-attribute: Attribute `connect` is not defined on `None` in union `Unknown | None`
tests/tools/test_browser_coordinate_click.py:18: [unresolved-import] unresolved-import: Cannot resolve imported module `websockets.asyncio.server`
tests/tools/test_browser_coordinate_click.py:428: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["required"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["y"]` on object of type `str`
scripts/benchmark_click_paths.py:197: [invalid-assignment] invalid-assignment: Object of type `bound method _SupervisorRegistry.get(task_id: str) -> CDPSupervisor | None` is not assignable to attribute `get` of type `def get(self, task_id: str) -> CDPSupervisor | None`
tools/browser_tool.py:3720: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `bound method dict[str, str | dict[str, dict[str, str]]].__getitem__(key: str, /) -> str | dict[str, dict[str, str]]` cannot be called with key of type `slice[None, Literal[60], None]` on object of type `dict[str, str | dict[str, dict[str, str]]]`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> str, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[str]]` cannot be called with key of type `Literal["y"]` on object of type `list[str]`
tests/tools/test_browser_coordinate_click.py:17: [unresolved-import] unresolved-import: Cannot resolve imported module `websockets`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> str, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[str]]` cannot be called with key of type `Literal["x"]` on object of type `list[str]`
tests/tools/test_browser_coordinate_click.py:177: [invalid-argument-type] invalid-argument-type: Argument to function `browser_click` is incorrect: Expected `int | float | None`, found `Literal["abc"]`
tests/tools/test_browser_coordinate_click.py:434: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["properties"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> Unknown, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[Unknown]]` cannot be called with key of type `Literal["x"]` on object of type `list[Unknown]`
scripts/benchmark_click_paths.py:177: [invalid-assignment] invalid-assignment: Object of type `() -> Literal["ws://127.0.0.1:63372/"]` is not assignable to attribute `_resolve_cdp_endpoint` of type `def _resolve_cdp_endpoint() -> str`
tests/tools/test_browser_coordinate_click.py:177: [invalid-argument-type] invalid-argument-type: Argument to function `browser_click` is incorrect: Expected `int | float | None`, found `Literal["def"]`
tests/tools/test_browser_coordinate_click.py:420: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> LiteralString, (key: SupportsIndex | slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> str]` cannot be called with key of type `Literal["x"]` on object of type `str`
tests/tools/test_browser_coordinate_click.py:15: [unresolved-import] unresolved-import: Cannot resolve imported module `pytest`
tests/tools/test_browser_coordinate_click.py:421: [invalid-argument-type] invalid-argument-type: Method `__getitem__` of type `Overload[(i: SupportsIndex, /) -> Unknown, (s: slice[SupportsIndex | None, SupportsIndex | None, SupportsIndex | None], /) -> list[Unknown]]` cannot be called with key of type `Literal["y"]` on object of type `list[Unknown]`
scripts/benchmark_click_paths.py:181: [invalid-assignment] invalid-assignment: Object of type `(tid) -> None` is not assignable to attribute `get` of type `def get(self, task_id: str) -> CDPSupervisor | None`

✅ Fixed issues: none

Unchanged: 3913 pre-existing issues carried over.

Diagnostics are surfaced as warnings — this check never fails the build.

Replace the 3-separate-_cdp_call() approach (one WS connection per message) with a single _cdp_coordinate_click_async() coroutine that opens the WebSocket once and sequences all CDP messages on it: 1. Target.getTargets 2. Target.attachToTarget (if page target found) 3. Input.dispatchMouseEvent (mousePressed) } pipelined — both sent 4. Input.dispatchMouseEvent (mouseReleased) } before awaiting either Benchmark vs real Lightpanda WS at ws://127.0.0.1:63372/ (300 iters): Baseline (current main, 3 connections): 3.14ms mean, 2.97ms median Optimized (this commit, 1 connection): 1.30ms mean, 1.11ms median Speedup: 2.42x mean, 2.68x median, 1.62x p95 The savings come entirely from eliminating 2 TCP+WS handshakes. mousePressed + mouseReleased are pipelined on the same connection, so they travel in the same network burst. 21/21 tests pass.

…ywright patterns) Two additional optimizations from researching Playwright, Puppeteer, and browser-harness source: SESSION ID CACHING (browser-harness daemon pattern) Target.getTargets + Target.attachToTarget are stable across clicks on the same page. Cache the resolved session_id keyed by CDP endpoint URL. Subsequent clicks skip straight to mousePressed+mouseReleased with no session negotiation overhead. Self-healing: on 'Session with given id not found' (stale after navigation), the cache is invalidated and session resolution runs once before retrying. This matches the exact retry pattern from browser-harness's daemon.handle(). SKIP mousePressed ACK (Playwright Promise.all pattern) Browser processes CDP messages sequentially within a session. If mouseReleased is acknowledged, mousePressed was already processed. We skip waiting for the press ack entirely, saving one RTT. This is the same pattern as Playwright's Mouse.click() using Promise.all and Puppeteer's concurrent down+up dispatch. COMPRESSION=NONE (Puppeteer NodeWebSocketTransport pattern) Small CDP messages (Input.dispatchMouseEvent payloads are ~80 bytes) don't benefit from per-message compression. Disable it explicitly. Puppeteer uses perMessageDeflate: false for the same reason. Benchmark vs real Lightpanda WS (300 iterations): Baseline (3 connections): 3.28ms mean Optimized cold cache (1 conn): 1.17ms mean (2.79x speedup) Optimized warm cache (1 conn): 1.17ms mean (2.82x speedup) The cold/warm delta is <0.01ms because getTargets+attachToTarget on an already-open socket costs almost nothing on localhost — the dominant cost is WS connection setup, which we eliminated in the previous commit. The session cache still removes real work (2 CDP round-trips) and prevents accumulating latency on remote/higher-latency CDP endpoints. Tests: 24 passed (21 existing + 3 new session caching tests)

…edup) The CDPSupervisor (browser_supervisor.py) already maintains a persistent WebSocket connection per task_id for dialog detection and frame tracking. After browser_navigate(), a supervisor is always running with an open WS. Instead of opening a new connection per click, dispatch directly on it. Changes: - browser_supervisor.py: add CDPSupervisor.dispatch_mouse_click() — sync bridge onto the supervisor's asyncio loop via run_coroutine_threadsafe. Pipelines mousePressed + mouseReleased via asyncio.gather (Playwright Promise.all pattern), no serial round-trips. - browser_tool.py: _cdp_coordinate_click() now checks SUPERVISOR_REGISTRY.get(task_id) first; falls back to per-click WS connect if no supervisor is running (e.g. raw CDP without navigate). Dispatch priority (fastest first): 1. Supervisor path — zero WS connection cost (supervisor WS already open) 2. Warm-cache path — 1 WS open + 2 mouse events (session cached) 3. Cold-cache path — 1 WS open + getTargets + attachToTarget + 2 events 4. agent-browser — 3 subprocess IPC calls (no CDP endpoint configured) Benchmark vs real Lightpanda WS at ws://127.0.0.1:63372/ (300 iterations): Baseline (3 connections): 4.86ms mean Warm cache (1 conn + cache): 1.30ms mean (3.74x) Supervisor (persistent WS): 0.20ms mean (23.75x) Ref-click IPC baseline: 0.14ms mean (parity) The supervisor path is 1.5x ref-click (0.07ms overhead) — essentially the cost of one cross-thread future dispatch. 27/27 tests pass (+3 new TestSupervisorPath tests).

…enchmark path Self-review findings addressed: - browser_tool.py: log swallowed supervisor error at DEBUG instead of bare 'pass' (was silent, triggered F841 for unused 'exc' variable). Renamed to '_exc' to signal intentional discard. - browser_tool.py: rename unused 'press_id' to '_press_id' in both normal and retry paths (mouseReleased-only wait is intentional; press_id is never used after send). - browser_tool.py: get_event_loop() → get_running_loop() in 3 locations inside _cdp_resolve_session and _cdp_coordinate_click_async. Both are async functions and get_event_loop() is deprecated in async context in Python 3.10+. - browser_supervisor.py: ensure_future → create_task in dispatch_mouse_click. create_task is the correct modern API when already inside a running coroutine; ensure_future is deprecated for coroutines in Python 3.10+. Also consistent with the rest of browser_supervisor.py which uses create_task exclusively everywhere else. - scripts/benchmark_click_paths.py: replace hardcoded /private/tmp/hermes- coord-click sys.path hack with __file__-relative repo root detection so the script works from any checkout location. 27/27 tests pass.

kshitijk4poor added type/feature New feature or request comp/tools Tool registry, model_tools, toolsets labels May 3, 2026

kshitijk4poor force-pushed the feat/browser-coordinate-click branch from 61263bc to 15c75b1 Compare May 3, 2026 11:30

alt-glitch added P3 Low — cosmetic, nice to have tool/browser Browser automation (CDP, Playwright) labels May 3, 2026

kshitijk4poor mentioned this pull request May 3, 2026

refactor: capability-based provider architecture for web tools (search/extract/browse split) #19198

Closed

kshitijk4poor force-pushed the feat/browser-coordinate-click branch from 15c75b1 to ff8c6f2 Compare May 7, 2026 04:27

kshitijk4poor added 3 commits May 7, 2026 10:04

kshitijk4poor changed the title ~~feat: add compositor-level coordinate click to browser_click~~ feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster) May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster)#19189

feat(browser): compositor-level coordinate click with persistent-WS dispatch (23x faster)#19189
kshitijk4poor wants to merge 5 commits intomainfrom
feat/browser-coordinate-click

kshitijk4poor commented May 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kshitijk4poor commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does

Problem

API

Architecture: 4-tier dispatch

Tier 1 — Supervisor path (new)

Tier 2+3 — Per-click WS (session-cached)

Benchmark

Optimizations sourced from research

Files changed

Commit history

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔎 Lint report: feat/browser-coordinate-click vs origin/main

ruff

ty (type checker)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kshitijk4poor commented May 3, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

🔎 Lint report: `feat/browser-coordinate-click` vs `origin/main`