Skip to content

Commit 3d55b2b

Browse files
authored
feat(playwright): add JsRenderingDetector parse filter & bolt (#1898)
* feat(playwright): add JsRenderingDetector parse filter Heuristically flags URLs whose content looks JavaScript-rendered by inspecting SPA framework fingerprints, noscript blocks, empty hydration roots, and a thin-content fallback. Sets a routing metadata key so DelegatorProtocol can dispatch subsequent fetches to Playwright while the bulk of the crawl stays on a cheap HTTP client. * feat(playwright): add JsRenderingRedirectionBolt and free-form match list Pairs JsRenderingDetector with a bolt that reads the routing flag and emits to StatusStreamName for an immediate refetch through Playwright, suppressing the cheap fetch's stub from the index. Adds a requiredMessages parameter to the detector for free-form JS-required / loader / cookie prompts that don't fit the noscript pattern. * style(playwright): apply google-java-format to detector and bolt
1 parent 7a17307 commit 3d55b2b

6 files changed

Lines changed: 920 additions & 0 deletions

File tree

docs/src/main/asciidoc/configuration.adoc

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -494,6 +494,64 @@ See the link:https://github.com/apache/stormcrawler/tree/main/external/playwrigh
494494
| playwright.load.event | - | Page load event to wait for (e.g., "domcontentloaded", "networkidle").
495495
|===
496496

497+
===== JS rendering detection
498+
499+
Browser fetching is much more expensive than a plain HTTP fetch, so most operators only want
500+
Playwright on URLs that actually need it. The `JsRenderingDetector` parse filter inspects the
501+
parsed page from a cheap fetch and sets a metadata flag (default `fetch.with=playwright`) on URLs
502+
that look JavaScript-rendered. Pair it with link:https://github.com/apache/stormcrawler/blob/main/core/src/main/java/org/apache/stormcrawler/protocol/DelegatorProtocol.java[DelegatorProtocol]
503+
to route subsequent fetches of those URLs to the Playwright protocol while leaving everything else
504+
on a fast HTTP client.
505+
506+
Detection signals (cheapest first, short-circuiting):
507+
508+
* SPA framework fingerprints in raw HTML — `data-reactroot`, `ng-version=`, `__NEXT_DATA__`,
509+
`window.__NUXT__`, `data-svelte-h=`, `data-vue-app`, `data-astro-cid`, `<router-outlet`.
510+
* `<noscript>` blocks containing language like _"enable JavaScript"_.
511+
* Empty SPA hydration roots: `<div id="root"></div>` / `#app` / `#__next` / `#__nuxt`.
512+
* Outcome-based fallback: at least one `<script>` is present and both `text.length` and the
513+
outlink count are below configurable thresholds.
514+
515+
Detection is skipped when `playwright.protocol.end` is already on the URL (i.e. it was just
516+
fetched by Playwright) or when the routing key is already set, so the filter is idempotent.
517+
518+
Register the filter in `parsefilters.json`:
519+
520+
[source,json]
521+
----
522+
{
523+
"class": "org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector",
524+
"name": "js-rendering-detector",
525+
"params": { "minTextLength": 200, "minOutlinks": 2 }
526+
}
527+
----
528+
529+
And route on the metadata key it sets:
530+
531+
[source,yaml]
532+
----
533+
http.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
534+
https.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
535+
protocol.delegator.config:
536+
- className: "org.apache.stormcrawler.protocol.playwright.HttpProtocol"
537+
filters:
538+
"fetch.with": "playwright"
539+
- className: "org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
540+
----
541+
542+
The dotted metadata key is quoted in the YAML above for readability; SnakeYAML accepts the
543+
unquoted form too. Note that `DelegatorProtocol` requires the *last* entry in
544+
`protocol.delegator.config` to have no `filters:` — it acts as the fallback, so keep the cheap
545+
protocol at the bottom of the list.
546+
547+
The parse filter alone does **not** trigger an immediate refetch — it only sets the metadata flag
548+
on the current fetch, and `DefaultScheduler` reschedules the URL according to the FETCHED interval
549+
(`fetchInterval.default`, 24h by default). For faster turnaround, either add a per-metadata-key
550+
fetch interval (`fetchInterval.fetch.with=playwright: 5`) or drop `JsRenderingRedirectionBolt`
551+
between the parser and indexer. The bolt reads the routing flag and, on hit, emits only to the
552+
status stream with `Status.FETCHED` so the stub document never reaches the index. The full
553+
parameter list and tuning notes are in the link:https://github.com/apache/stormcrawler/tree/main/external/playwright[playwright module README].
554+
497555
==== Language ID
498556

499557
Language identification for crawled documents using the lang-detect library.

external/playwright/README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,3 +47,112 @@ Per-URL metadata triggers:
4747
|---|---|
4848
| `playwright.trace` | If present on the input metadata, a Playwright trace zip is recorded for the navigation and its path is returned in the response metadata under the same key. |
4949

50+
## JS rendering detection
51+
52+
Browser-based fetching is expensive — typically 10–50× slower than a plain HTTP fetch and limited by how many browsers a host can run concurrently. Most operators only want Playwright on the URLs that actually need it. The `JsRenderingDetector` parse filter solves the routing question without adding new infrastructure: it inspects the parsed page from a cheap fetch and, when the content looks JS-rendered, sets a metadata flag that `DelegatorProtocol` (already part of `core`) routes on.
53+
54+
### How detection works
55+
56+
The filter applies four heuristics, cheapest-first, and short-circuits on the first hit:
57+
58+
1. **SPA framework fingerprints** in raw HTML — `data-reactroot`, `ng-version=`, `__NEXT_DATA__`, `window.__NUXT__`, `data-svelte-h=`, `data-vue-app`, `data-astro-cid`, `<router-outlet`. Defaults are overridable via the `fingerprints` parameter.
59+
2. **`<noscript>` blocks** that explicitly request JavaScript — match patterns like _"enable JavaScript"_, _"requires JavaScript"_, _"JavaScript is disabled"_.
60+
3. **Empty SPA hydration roots**`<div id="root"></div>` / `#app` / `#__next` / `#__nuxt` with no children. IDs override­able via `emptyRootIds`.
61+
4. **Outcome-based fallback** — when at least one `<script>` is present and both `text.length < minTextLength` (default 200) and `outlinks.size() < minOutlinks` (default 2), the URL is flagged as a thin SPA. The `<script>` gate keeps the filter from flagging static error stubs.
62+
63+
### What the filter sets
64+
65+
| Metadata key | Value | Notes |
66+
|---|---|---|
67+
| `fetch.with` | `playwright` | Routing key, override­able via `metadataKey` / `metadataValue`. |
68+
| `fetch.with.reason` | e.g. `fingerprint:data-reactroot`, `noscript-js-required`, `empty-root:root`, `thin-content:text=12,outlinks=0` | Diagnostic — set unless `recordReason: false`. |
69+
70+
### Loop guards
71+
72+
- Detection is skipped when `playwright.protocol.end` is already present on the URL — i.e. the URL was just fetched by Playwright; reapplying the heuristic would just reflag it. Override the watch key via `skipIfMetadataPresent`.
73+
- Detection is also skipped when the routing key is already set, so the filter is idempotent and safe to leave permanently in `parsefilters.json`.
74+
75+
### Parameters
76+
77+
| Name | Type | Default | Notes |
78+
|---|---|---|---|
79+
| `metadataKey` | string | `fetch.with` | Routing key set on a hit. |
80+
| `metadataValue` | string | `playwright` | Value to set. |
81+
| `minTextLength` | int | `200` | Outcome-based threshold for visible text. |
82+
| `minOutlinks` | int | `2` | Outcome-based threshold for extracted outlinks. |
83+
| `fingerprints` | string array | _see above_ | Substrings searched in raw HTML; replaces defaults when set. |
84+
| `emptyRootIds` | string array | `["root","app","__next","__nuxt"]` | Element IDs treated as empty SPA hydration roots. |
85+
| `requiredMessages` | string array | _empty_ | Additional substrings that, when found anywhere in the HTML, flag the URL. Use for site-specific JS-required prompts and loader text that don't fit the noscript pattern (e.g. `"Loading..."`, `"[object Object]"`, `"Please enable cookies"`). |
86+
| `skipIfMetadataPresent` | string | `playwright.protocol.end` | Short-circuit when this metadata key is set. Empty string disables. |
87+
| `recordReason` | bool | `true` | Also set `metadataKey + ".reason"` describing which signal fired. |
88+
89+
### Wiring
90+
91+
Add the filter to your `parsefilters.json`:
92+
93+
```json
94+
{
95+
"class": "org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector",
96+
"name": "js-rendering-detector",
97+
"params": { "minTextLength": 200, "minOutlinks": 2 }
98+
}
99+
```
100+
101+
Route on the metadata key it sets via `DelegatorProtocol`:
102+
103+
```yaml
104+
http.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
105+
https.protocol.implementation: "org.apache.stormcrawler.protocol.DelegatorProtocol"
106+
protocol.delegator.config:
107+
- className: "org.apache.stormcrawler.protocol.playwright.HttpProtocol"
108+
filters:
109+
"fetch.with": "playwright"
110+
- className: "org.apache.stormcrawler.protocol.okhttp.HttpProtocol"
111+
```
112+
113+
A few wiring notes:
114+
115+
- The dotted metadata key (`fetch.with`) is quoted in the YAML above to make it unambiguous to a human reader; SnakeYAML treats unquoted `fetch.with: "playwright"` as the same single-key scalar, so either parses correctly.
116+
- `DelegatorProtocol` requires the **last** entry in `protocol.delegator.config` to have no `filters:` — it acts as the fallback. Keep OkHttp (or whichever cheap protocol you pick) at the bottom of the list.
117+
- The filter alone does **not** trigger an immediate refetch. It only sets the metadata; the URL is rescheduled by `DefaultScheduler` according to the FETCHED interval (`fetchInterval.default`, 24h by default), and `DelegatorProtocol` picks Playwright on the next scheduled fetch. To get faster turnaround, either drop in the `JsRenderingRedirectionBolt` described below, or add a per-metadata-key fetch interval to your YAML: `fetchInterval.fetch.with=playwright: 5` (refetch flagged URLs in 5 minutes instead of 24 hours).
118+
- Sibling URLs on the same host don't inherit the flag — that requires a host-keyed metadata transfer scheme and is intentionally out of scope.
119+
120+
### Forcing an immediate refetch — `JsRenderingRedirectionBolt`
121+
122+
The detector flags URLs but doesn't, on its own, prevent the cheap fetch's stub document from flowing downstream into the parser, indexer, and outlink emission. For most crawls that's fine — the next scheduled fetch replaces the stub with the rendered version. If you want the stub to be discarded and the URL refetched immediately through Playwright, drop `JsRenderingRedirectionBolt` between the parser and the indexer. The bolt:
123+
124+
- reads the routing flag set by the detector (or any other upstream component),
125+
- on hit, emits **only** to `StatusStreamName` with `Status.FETCHED` so the URL is rescheduled and the stub never reaches the index,
126+
- on miss, passes the tuple through unchanged,
127+
- short-circuits when `playwright.protocol.end` is already on the URL — the loop guard.
128+
129+
The bolt has no detection logic of its own; it just acts on the metadata flag. That keeps the heuristics in one place (the parse filter) and lets you swap or extend the bolt independently.
130+
131+
Topology fragment:
132+
133+
```text
134+
... -> JSoupParserBolt -> JsRenderingRedirectionBolt -> IndexerBolt -> ...
135+
\-> StatusStream
136+
```
137+
138+
YAML:
139+
140+
```yaml
141+
# refetch flagged URLs in 5 minutes rather than 24 hours
142+
fetchInterval.fetch.with=playwright: 5
143+
```
144+
145+
Configuration keys:
146+
147+
| Key | Default | Notes |
148+
|---|---|---|
149+
| `playwright.redirect.metadata.key` | `fetch.with` | Routing key the bolt watches for. |
150+
| `playwright.redirect.metadata.value` | `playwright` | Value the bolt watches for. |
151+
| `playwright.redirect.skip.if.metadata.present` | `playwright.protocol.end` | Loop guard — pass through unchanged when this key is set on the URL. Empty string disables. |
152+
153+
### When _not_ to use it
154+
155+
- **Operator allowlist suffices.** If you already know which hosts need a browser, add them as a `urlPatterns` rule on the Playwright leg of `DelegatorProtocol` and skip the filter.
156+
- **Anti-bot / WAF challenge pages.** Cloudflare, DataDome, and Akamai challenge fingerprints aren't covered here; those usually need a stealth-mode browser, not just rendering.
157+
- **Aggressively first-fetch-sensitive crawls.** The first fetch on an unknown SPA host is always wasted (you get a stub document) before the filter learns about the host. If that's unacceptable, prefer the operator allowlist.
158+
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
/*
2+
* Licensed to the Apache Software Foundation (ASF) under one or more
3+
* contributor license agreements. See the NOTICE file distributed with
4+
* this work for additional information regarding copyright ownership.
5+
* The ASF licenses this file to you under the Apache License, Version 2.0
6+
* (the "License"); you may not use this file except in compliance with
7+
* the License. You may obtain a copy of the License at
8+
*
9+
* http://www.apache.org/licenses/LICENSE-2.0
10+
*
11+
* Unless required by applicable law or agreed to in writing, software
12+
* distributed under the License is distributed on an "AS IS" BASIS,
13+
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
* See the License for the specific language governing permissions and
15+
* limitations under the License.
16+
*/
17+
18+
package org.apache.stormcrawler.protocol.playwright.bolt;
19+
20+
import java.util.Map;
21+
import org.apache.storm.task.OutputCollector;
22+
import org.apache.storm.task.TopologyContext;
23+
import org.apache.storm.topology.OutputFieldsDeclarer;
24+
import org.apache.storm.topology.base.BaseRichBolt;
25+
import org.apache.storm.tuple.Fields;
26+
import org.apache.storm.tuple.Tuple;
27+
import org.apache.storm.tuple.Values;
28+
import org.apache.stormcrawler.Constants;
29+
import org.apache.stormcrawler.Metadata;
30+
import org.apache.stormcrawler.persistence.Status;
31+
import org.apache.stormcrawler.protocol.playwright.HttpProtocol;
32+
import org.apache.stormcrawler.util.ConfUtils;
33+
import org.slf4j.LoggerFactory;
34+
35+
/**
36+
* Bolt that consumes the routing flag set by {@link
37+
* org.apache.stormcrawler.protocol.playwright.parsefilter.JsRenderingDetector} (or any other
38+
* upstream component) and forces an immediate refetch through Playwright instead of letting the
39+
* cheap fetch's stub document propagate downstream.
40+
*
41+
* <p>Pipeline placement: between the parser bolt (which produces tuples of {@code (url, content,
42+
* metadata, text)}) and the indexer / persistence bolts. On hit, the bolt emits only to the {@link
43+
* Constants#StatusStreamName} with status {@link Status#FETCHED}, so the URL is rescheduled and the
44+
* stub never reaches the index. On miss, the tuple passes through to the default stream unchanged.
45+
*
46+
* <p>Pair this with a per-metadata-key fetch interval to control how soon the refetch happens — by
47+
* default {@code Status.FETCHED} reschedules at {@code fetchInterval.default} (24h):
48+
*
49+
* <pre>{@code
50+
* # refetch flagged URLs in 5 minutes rather than 24 hours
51+
* fetchInterval.fetch.with=playwright: 5
52+
* }</pre>
53+
*
54+
* <h3>Configuration</h3>
55+
*
56+
* <ul>
57+
* <li>{@code playwright.redirect.metadata.key} (default {@code fetch.with})
58+
* <li>{@code playwright.redirect.metadata.value} (default {@code playwright})
59+
* <li>{@code playwright.redirect.skip.if.metadata.present} (default {@link
60+
* HttpProtocol#MD_KEY_END}) — passes the tuple through unchanged when this metadata key is
61+
* already set, preventing loops with content that came back from Playwright. Set to empty to
62+
* disable the loop guard.
63+
* </ul>
64+
*/
65+
public class JsRenderingRedirectionBolt extends BaseRichBolt {
66+
67+
private static final org.slf4j.Logger LOG =
68+
LoggerFactory.getLogger(JsRenderingRedirectionBolt.class);
69+
70+
public static final String CONF_METADATA_KEY = "playwright.redirect.metadata.key";
71+
public static final String CONF_METADATA_VALUE = "playwright.redirect.metadata.value";
72+
public static final String CONF_SKIP_IF_METADATA_PRESENT =
73+
"playwright.redirect.skip.if.metadata.present";
74+
75+
public static final String DEFAULT_METADATA_KEY = "fetch.with";
76+
public static final String DEFAULT_METADATA_VALUE = "playwright";
77+
78+
private OutputCollector collector;
79+
private String routingKey;
80+
private String routingValue;
81+
private String skipIfMetadataPresent;
82+
83+
@Override
84+
public void prepare(
85+
final Map<String, Object> conf,
86+
final TopologyContext context,
87+
final OutputCollector collector) {
88+
this.collector = collector;
89+
this.routingKey = ConfUtils.getString(conf, CONF_METADATA_KEY, DEFAULT_METADATA_KEY);
90+
this.routingValue = ConfUtils.getString(conf, CONF_METADATA_VALUE, DEFAULT_METADATA_VALUE);
91+
this.skipIfMetadataPresent =
92+
ConfUtils.getString(conf, CONF_SKIP_IF_METADATA_PRESENT, HttpProtocol.MD_KEY_END);
93+
}
94+
95+
@Override
96+
public void execute(final Tuple tuple) {
97+
final String url = tuple.getStringByField("url");
98+
final byte[] content = tuple.getBinaryByField("content");
99+
final Metadata metadata = (Metadata) tuple.getValueByField("metadata");
100+
final String text = tuple.getStringByField("text");
101+
102+
if (shouldRedirect(metadata)) {
103+
LOG.debug("Redirecting {} to Playwright (status stream)", url);
104+
collector.emit(
105+
Constants.StatusStreamName, tuple, new Values(url, metadata, Status.FETCHED));
106+
} else {
107+
collector.emit(tuple, new Values(url, content, metadata, text));
108+
}
109+
collector.ack(tuple);
110+
}
111+
112+
private boolean shouldRedirect(final Metadata metadata) {
113+
if (metadata == null) {
114+
return false;
115+
}
116+
if (skipIfMetadataPresent != null
117+
&& !skipIfMetadataPresent.isEmpty()
118+
&& metadata.containsKey(skipIfMetadataPresent)) {
119+
// already came back from Playwright — don't loop
120+
return false;
121+
}
122+
return metadata.containsKeyWithValue(routingKey, routingValue);
123+
}
124+
125+
@Override
126+
public void declareOutputFields(final OutputFieldsDeclarer declarer) {
127+
declarer.declare(new Fields("url", "content", "metadata", "text"));
128+
declarer.declareStream(Constants.StatusStreamName, new Fields("url", "metadata", "status"));
129+
}
130+
}

0 commit comments

Comments
 (0)