feat(dom-rules): Add 300+ dom rules by cubewhy · Pull Request #931 · mengxi-ream/read-frog

cubewhy · 2026-02-01T12:13:01Z

Type of Changes

Description

Introduce 300+ dom rules from @TianmuTNT's fork
Refactor original dom-rules.ts for loading rules from json (I have merged the exist rules into the json file)

Related Issue

None

How Has This Been Tested?

Verified via a manually smoke test

Added unit tests
Verified through manual testing

Screenshots

None

Checklist

I have tested these changes locally
I have updated the documentation accordingly if necessary
My code follows the code style of this project
My changes do not break existing functionality
If my code was generated by AI, I have proofread and improved it as necessary.

I think I cannot access the source code of page https://www.readfrog.app/zh/tutorial/code-contribution/custom-dom-rules
So I cannot modify the docs.

Additional Information

Summary by cubic

Added 360+ website-specific DOM exclusion and block-translation rules, loaded from a JSON config with wildcard URL matching. This improves translation accuracy across popular sites and makes rules easier to maintain.

New Features
- Added JSON-based DOM rules (dontWalkIntoSelectors, forceBlockTranslationSelectors).
- Implemented wildcard URL pattern matching (*, **) with protocol-optional support.
Refactors
- dom-rules.ts now loads rules from JSON and exposes findMatchingSelectors.
- Updated filter.ts to use dynamic selector lookup.
- Enabled resolveJsonModule in tsconfig.

^{Written for commit 635f45a. Summary will update on new commits.}

changeset-bot · 2026-02-01T12:13:05Z

🦋 Changeset detected

Latest commit: 635f45a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@read-frog/extension	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

dosubot · 2026-02-01T12:13:15Z

Related Documentation

Checked 6 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

cubic-dev-ai

1 issue found across 5 files

Prompt for AI agents (all issues)


Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/utils/constants/dom-rules.ts">

<violation number="1" location="src/utils/constants/dom-rules.ts:152">
P2: findMatchingSelectors returns empty results while domRules is still loading asynchronously, and call sites use it synchronously. Early DOM scans can skip domain-specific blocking rules until the JSON import finishes, changing behavior from the previous synchronous constants.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

src/utils/constants/dom-rules.ts

claude · 2026-02-01T15:11:21Z

PR Review: feat(dom-rules): Add 300+ dom rules

Overall Assessment

🟢 Good addition - Bringing in 360+ website-specific DOM rules from a fork is valuable for improving translation accuracy across popular sites. The refactor to JSON-based configuration makes rules easier to maintain.

Key Insights

Data structure: JSON config with two rule types (dontWalkIntoSelectors, forceBlockTranslationSelectors) mapped by URL patterns. Clean separation.

Complexity: The glob-to-regex pattern matching is the main logic addition. Implementation is straightforward but has edge cases.

Risk points:

Existing tests may break due to changed imports/exports
Performance impact of pattern matching on every DOM check
Some rules in JSON duplicate the same selector arrays (e.g., Twitter variants)

Issues to Address

1. Potential Breaking Change: YouTube rules removed

The original CUSTOM_DONT_WALK_INTO_ELEMENT_SELECTOR_MAP had YouTube-specific rules using imported constants:

`.${SUBTITLES_VIEW_CLASS}`,
`.${STATE_MESSAGE_CLASS}`,
`.${TRANSLATE_BUTTON_CLASS}`,

These dynamic selectors based on JS constants are NOT present in the JSON file. The YouTube entry in dom-rules.json only has:

"www.youtube.com": [".ytp-caption-segment"]

This is a regression - the extension's own subtitle-related elements won't be excluded from translation.

Suggestion: Either add the actual class values to the JSON, or create a mechanism to merge runtime constants with JSON rules for YouTube.

2. Performance: Pattern matching on every element

findMatchingSelectors() is called for every element check via isCustomDontWalkIntoElement() and isCustomForceBlockTranslation(). Each call:

Creates a new URL object
Iterates through all 300+ patterns when no exact match

Suggestion: Cache the matched selectors per URL at the start of page translation rather than re-computing per element.

3. Massive duplication in JSON

Twitter/X rules are duplicated 10+ times for different subdomains. Same for Stack Exchange sites.

Suggestion: Consider using wildcard patterns more aggressively:

"*.twitter.com": [...],
"*.x.com": [...],

Or introduce rule inheritance/aliases to reduce the 2200-line JSON file.

4. Missing tests for new pattern matching logic

globToRegex() and matchUrlPattern() are new untested functions with edge cases:

Protocol handling (https:// vs http:// vs none)
Wildcard at different positions (*.domain.com, domain.com/*, **)
URL with query strings and fragments

The existing tests in custom-dont-walk.test.ts and custom-force-block.test.ts use setHost() which only tests hostname matching, not the new URL pattern features.

5. Edge case bug in matchUrlPattern

if (!pattern.startsWith('http://') && !pattern.startsWith('https://')) {
  return regex.test(url.replace(/^https?:\/\//, ''))
}

If pattern doesn't start with protocol but URL does, it strips protocol from URL. But if pattern contains path like github.com/user/*, the generated regex expects the pattern format, not the stripped URL format. This could cause false negatives.

Minor Issues

Comment inconsistency (line 2340-2341):

if (!domRules)
  // Not initialized yet
  return []

Comment on wrong line - should be above or inline.

Empty selector handling: If dontWalkIntoElementSelectorList is empty, joining produces "" which when passed to element.matches("") throws an error. The check if (!dontWalkSelector) handles this, but worth noting.
Some JSON patterns use full URLs with https://:

"https://www.reddit.com/r/*/comments/*/*": [...]

While others use just hostnames. This inconsistency works due to the pattern matching but reduces clarity.

What's Good

Clean interface design (DomRulesConfig)
Proper TypeScript typing with the JSON module
Sensible fallback behavior (empty array when no rules match)
Good changeset documentation

Recommendation

Worth merging with fixes:

Add YouTube constants to JSON (or implement runtime merge)
Add caching for pattern matching results
Add unit tests for globToRegex and matchUrlPattern

The core value (300+ rules) is solid. The implementation needs minor hardening before merge.

ananaBMaster · 2026-02-02T07:27:51Z

can I ask where are these rules originally from?

cubewhy · 2026-02-03T12:27:41Z

can I ask where are these rules originally from?

Sorry for the late reply

@TianmuTNT reversed some rules from the "Immersive Translate" plugin and wrote a script to convert to the read-frog format.

claude · 2026-02-04T07:07:39Z

Code Review: PR #931 - feat(dom-rules): Add 300+ dom rules

Summary

This PR adds 360+ website-specific DOM exclusion rules from @TianmuTNT's fork and refactors the rule loading system to use a JSON configuration file with wildcard URL pattern matching.

🟢 Good taste overall

The approach of externalizing rules to JSON is sound - it makes rules more maintainable and easier to contribute to without modifying TypeScript code.

✅ What Works Well

JSON-based configuration: Moving rules to a JSON file is a good separation of concerns
Wildcard pattern matching: The * and ** glob support is useful for matching subdomains and paths
Protocol-agnostic matching: Stripping protocols for comparison is sensible
Backward compatibility: Existing rules appear preserved, filter.ts changes are minimal

🔴 Issues to Address

1. Critical: Massive selector array duplication

The Twitter/X rules are duplicated across 10+ domains with identical selectors:

twitter.com
mobile.twitter.com
tweetdeck.twitter.com
pro.twitter.com
platform.twitter.com/embed*
x.com
mobile.x.com
tweetdeck.x.com
pro.x.com
platform.x.com/embed*

Recommendation: Use the wildcard support to consolidate:

"*.twitter.com": [...],
"*.x.com": [...],
"twitter.com": [...],
"x.com": [...]

Same issue with Stack Exchange sites (stackoverflow, superuser, askubuntu, serverfault, *.stackexchange.com), Quora domains, and Telegram domains.

DRY violation - ~500+ lines could be reduced to ~100.

2. Performance concern: Linear pattern matching

findMatchingSelectors iterates through all patterns on every call:

for (const [pattern, selectors] of Object.entries(ruleset || {})) {
  if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern)) {
    return selectors
  }
}

With 300+ patterns, this could be slow. The function correctly checks exact hostname match first (fast path), but consider:

Caching the matched selectors per URL/hostname
Or building a Map at initialization time for non-wildcard patterns

3. Fragile CSS class selectors

Many selectors use generated class names that may change:

"[class='css-901oao r-14j79pv r-37j5jr r-n6v787 r-16dba41 r-1cwl3u0 r-bcqeeo r-qvutc0']"
"[class='css-175oi2r r-1habvwh r-vqp9x9 r-1q9bdsx r-1loqt21 r-9njtsq r-1wtj0ep r-nsbfu8 r-xbdcod r-13c7hvr']"

These are brittle and will break when sites update their CSS bundler config. Consider using data-testid or semantic attributes where available.

4. Missing tests for new functionality

The matchUrlPattern and globToRegex functions lack unit tests. The existing test (custom-dont-walk.test.ts) tests chatgpt.com which exists in the old hardcoded rules but I don't see it in the new JSON.

Important: Add tests for:

Wildcard * matching (single segment)
Wildcard ** matching (multi-segment)
Protocol stripping behavior
Edge cases (empty patterns, malformed URLs)

🟡 Suggestions for Improvement

Add schema validation: Consider adding a JSON schema or Zod validation for the rules file to catch typos/invalid selectors early
Document the pattern format: Add a comment or README explaining the URL matching syntax (*, **, protocol handling)
Consider lazy loading: The 2200-line JSON file is imported synchronously. For extension startup time, consider lazy loading if this grows further
Inconsistent URL patterns: Some entries use full URLs with protocol (https://www.reddit.com/), others just hostnames (stackoverflow.com). Standardize the format
Missing YouTube subtitles exclusions: The old code had these constants:

`${YOUTUBE_NATIVE_SUBTITLES_CLASS}`,
`.${SUBTITLES_VIEW_CLASS}`,
`.${STATE_MESSAGE_CLASS}`,
`.${TRANSLATE_BUTTON_CLASS}`,

These dynamic class references seem lost in the migration. Verify these are handled elsewhere.

Test Coverage

New matchUrlPattern function needs unit tests
New globToRegex function needs unit tests
Wildcard matching edge cases need coverage

Security

No security concerns identified. The rules are static JSON data with CSS selectors.

Final Verdict

Approve with changes requested. The core refactoring is good, but:

The massive duplication should be fixed before merge
Unit tests for the new URL matching logic are important
Verify the YouTube subtitle class constants are handled correctly

The bulk of the work (collecting 300+ rules) is valuable for the community. Just needs some polish before merging.

mengxi-ream

最大的问题时规则好像有 bug，一些规则我猜测可能应该是“仅翻译”，把它弄成了不翻译。比如 Github 的效果是这样的：

沉浸式翻译是：

这两个翻译和不翻译的东西颠倒了。所以可能是规则在扒沉浸式翻译的规则的时候出现了问题。

能不能把扒规则的脚本之类的也发出来看下？

mengxi-ream · 2026-02-06T04:10:09Z

.changeset/wise-doors-chew.md

+- News and media sites (Reuters, CNBC, New York Times)
+- And many more
+
+Also implemented wildcard pattern matching (`*`, `**`) and JSON-based configuration to support flexible URL patterns.


make this changeset shorter

mengxi-ream · 2026-02-06T04:48:47Z

src/utils/host/dom/filter.ts


 export function isCustomDontWalkIntoElement(element: HTMLElement): boolean {
-  const dontWalkIntoElementSelectorList = CUSTOM_DONT_WALK_INTO_ELEMENT_SELECTOR_MAP[window.location.hostname] ?? []
+  const dontWalkIntoElementSelectorList = findMatchingSelectors('dontWalkIntoSelectors', window.location.href)


findMatchingSelectors 并非 O(1) 会有严重性能问题

mengxi-ream · 2026-02-06T04:48:57Z

src/utils/host/dom/filter.ts


 export function isCustomForceBlockTranslation(element: HTMLElement): boolean {
-  const forceBlockSelectorList = CUSTOM_FORCE_BLOCK_TRANSLATION_SELECTOR_MAP[window.location.hostname] ?? []
+  const forceBlockSelectorList = findMatchingSelectors('forceBlockTranslationSelectors', window.location.href)


findMatchingSelectors 并非 O(1) 会有严重性能问题，应该作为一个参数传递

mengxi-ream · 2026-02-06T04:49:19Z

.changeset/wise-doors-chew.md

+- News and media sites (Reuters, CNBC, New York Times)
+- And many more
+
+Also implemented wildcard pattern matching (`*`, `**`) and JSON-based configuration to support flexible URL patterns.


changeset 尽量简化，不要写很多，一两句话

TianmuTNT · 2026-02-06T04:51:39Z

rules.json
@mengxi-ream

cubewhy · 2026-02-06T04:52:59Z

wrote a script

In fact this is not generated by a script

from @TianmuTNT

ananaBMaster · 2026-02-06T04:54:23Z

能不能给原仓库开发者开放一下权限：

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂，很多没必要的 if else，代码太多了，不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

cubewhy · 2026-02-06T04:57:42Z

能不能给原仓库开发者开放一下权限：

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

🤔

cubewhy · 2026-02-06T05:04:24Z

用笨办法了, 现在应该可以了

mengxi-ream · 2026-02-06T05:06:10Z

用笨办法了, 现在应该可以了

或许你可以直接复制粘贴一下上面 @ananaBMaster 的代码hhh

mengxi-ream · 2026-02-06T05:08:06Z

rules.json @mengxi-ream

奇怪，看到 github 的确实是

  "additionalExcludeSelectors.add": [
            "[data-test-selector='commit-tease-commit-message']",
            "[data-test-selector='create-branch.developmentForm']",
            "div.Box-header.position-relative",

不过我也看到他们有

        "selectors": [
            "h1",
            "[aria-label=Issues] .markdown-title",
            "[aria-labelledby=discussions-list] .markdown-title",
            "h3 .markdown-title",
            ".markdown-body",
            ".Layout-sidebar p",
            "div > span.search-match",

是不是这两者结合有什么机制

目前看效果，应该 Github 的这个例子不太对的感觉。

cubewhy · 2026-02-06T05:34:00Z

是不是这两者结合有什么机制

我认为规则是这样运行的

选择 selectors 里的所有元素, 然后排除能被 additionalExcludeSelectors 选中的元素

现在这个 pr 只排除了 additionalExcludeSelectors, 导致有的东西被错误翻译了

我没看过沉浸式翻译的代码, 仅猜测

mengxi-ream · 2026-02-06T06:11:13Z

是不是这两者结合有什么机制

我认为规则是这样运行的

选择 selectors 里的所有元素, 然后排除能被 additionalExcludeSelectors 选中的元素

现在这个 pr 只排除了 additionalExcludeSelectors, 导致有的东西被错误翻译了

我没看过沉浸式翻译的代码, 仅猜测

我也是这样想的，就是他们有某种集合相互影响的机制，可能某个会覆盖另外一个规则。

所以这个 PR 合并的话，还需要实现一下这个机制才行

Co-authored-by: MengXi <contact@mengxi.work>

cubewhy · 2026-02-06T07:00:57Z

BTW, 看上去 eslint 不太喜欢用 .gitignorre
我用 nixos, 有 devenv 文件, eslint 会进去 lint, 导致无法提交

import { includeIgnoreFile } from '@eslint/compat'

const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const gitignorePath = path.resolve(__dirname, '.gitignore')

export default antfu({
    /*...*/
}).append(includeIgnoreFile(gitignorePath, 'Imported .gitignore patterns')).append(includeIgnoreFile(gitExcludePath, 'Imported .git/info/exclude patterns'))

cubewhy · 2026-02-06T07:12:01Z

能不能给原仓库开发者开放一下权限：

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂，很多没必要的 if else，代码太多了，不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

现在这个仓库里的代码是 @TianmuTNT Vibe 的
我觉得更快的方法是用 Trie Tree 来匹配 url
for 实在慢

mengxi-ream · 2026-02-06T08:01:15Z

devenv

BTW, 看上去 eslint 不太喜欢用 .gitignorre 我用 nixos, 有 devenv 文件, eslint 会进去 lint, 导致无法提交

import { includeIgnoreFile } from '@eslint/compat'

const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const gitignorePath = path.resolve(__dirname, '.gitignore')

export default antfu({
    /*...*/
}).append(includeIgnoreFile(gitignorePath, 'Imported .gitignore patterns')).append(includeIgnoreFile(gitExcludePath, 'Imported .git/info/exclude patterns'))

让 eslint 忽略对应文件应该就可以

mengxi-ream · 2026-02-06T08:01:55Z

能不能给原仓库开发者开放一下权限：

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'cubewhy/read-frog.git': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂，很多没必要的 if else，代码太多了，不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

现在这个仓库里的代码是 @TianmuTNT Vibe 的我觉得更快的方法是用 Trie Tree 来匹配 url for 实在慢

一个是可以用 trie，另一个是可以只跑一次，把结果作为参数往下传递。

claude · 2026-02-08T08:33:48Z

Code Review: feat(dom-rules): Add 300+ dom rules

Overall Assessment

The feature concept is valuable -- moving from a small hardcoded map to a comprehensive JSON-based rule system with glob pattern matching is the right direction. However, there are several significant issues that should be addressed before merging.

Critical Issues

Lost YouTube selectors that reference runtime constants:
The original dom-rules.ts had YouTube selectors that use imported constants:
```
YOUTUBE_NATIVE_SUBTITLES_CLASS
SUBTITLES_VIEW_CLASS
STATE_MESSAGE_CLASS
TRANSLATE_BUTTON_CLASS
```
These are dynamic class names imported from ./subtitles. The new JSON-based approach cannot represent these since JSON is static. The YouTube entry in dom-rules.json does not include equivalents for these selectors. This is a regression that could cause the extension's own UI elements (subtitles overlay, translate buttons) to be incorrectly walked into and translated.
Lost existing selectors for www.reddit.com and www.youtube.com:
The original code had more specific selectors (e.g., Reddit's faceplate-screen-reader-content > *, reddit-header-large *; YouTube's #masthead-container *, #guide-inner-content *, #metadata *, etc.) that are not present in the JSON file. The JSON has different Reddit/YouTube selectors from the fork, but the originals were lost. These need to be merged, not replaced.
Lost github.com selectors:
Original had [aria-labelledby="folders-and-files"] *, header *, #repository-container-header *, [class*="OverviewContent-module__Box_1--"] *. The JSON has different GitHub selectors. Again, merge needed rather than replacement.

Performance Concerns

findMatchingSelectors is called on every DOM element check:
isCustomDontWalkIntoElement and isCustomForceBlockTranslation are called during DOM tree walking, potentially thousands of times per page. Each call to findMatchingSelectors iterates over all URL patterns in the JSON (300+ entries) with regex compilation via globToRegex. This regex is recompiled on every call.

Recommendation: Cache the compiled regexes (build them once at module load or first use) and cache the findMatchingSelectors result per URL (memoize). The URL doesn't change during a page session, so this result should be computed once.
Massive JSON duplication inflates bundle size:
The dom-rules.json file is ~2200 lines with enormous duplication. For example, twitter.com, mobile.twitter.com, tweetdeck.twitter.com, pro.twitter.com, platform.twitter.com/embed*, x.com, mobile.x.com, tweetdeck.x.com, pro.x.com, platform.x.com/embed* all have the exact same 17 selectors copy-pasted 10 times. Similarly, Stack Exchange sites (stackoverflow.com, *.stackexchange.com, superuser.com, askubuntu.com, serverfault.com) duplicate the same selectors 5 times despite *.stackexchange.com already covering most of them.

The wildcard matching already supports patterns like *.twitter.com or *.x.com which would cover all subdomains. This JSON could be reduced by ~30-40% by deduplicating.

Code Quality

globToRegex sentinel placeholder is fragile:
Using a special string as a temporary placeholder works but is fragile. If any URL pattern ever contains this literal string, matching would break. A safer approach would be to process ** before escaping, or use a different transformation strategy.
findMatchingSelectors returns only the first match:
The current logic returns early on the first matching pattern. If a URL matches multiple patterns (e.g., both *.google.com and a more specific www.google.*/search*), only one set of selectors is returned. This could be intentional, but the order depends on Object.entries() iteration order which follows JSON key order. This behavior should be documented or selectors should be merged across all matching patterns.
Missing arxiv.org selectors:
The original had arxiv.org: ['.ltx_listing'] which is not in the JSON. The JSON has browse.arxiv.org and arxiv.org/html/* with different selectors, but the base arxiv.org pattern for .ltx_listing is lost.

Test Coverage

No unit tests for globToRegex or matchUrlPattern functions
No unit tests for findMatchingSelectors
The PR CI is missing the "Test & Build" check -- this should pass before merge
These are pure functions that are easy to test and critical for correctness

Security

No security concerns. The JSON is static data bundled with the extension.

Suggestions

Must fix: Merge existing selectors (especially YouTube dynamic class selectors, Reddit, GitHub) with the new JSON rules rather than replacing them. The YouTube dynamic selectors may need to remain in TypeScript code.
Must fix: Add memoization/caching to findMatchingSelectors -- the URL doesn't change mid-page, so compute once and cache.
Should fix: Deduplicate the JSON by leveraging wildcards (e.g., *.twitter.com instead of 5 separate twitter subdomains).
Should fix: Add unit tests for globToRegex, matchUrlPattern, and findMatchingSelectors.
Nice to have: Pre-compile glob regexes at module load time.

TianmuTNT and others added 2 commits February 1, 2026 18:50

feat: add 360+ website-specific DOM exclusion rules

146f538

refactor: unify dom-rules.json and built-in rules

2a66606

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 1, 2026

cubewhy changed the title ~~Add 300+ dom rules~~ feat(dom-rules): Add 300+ dom rules Feb 1, 2026

github-actions bot added the feat label Feb 1, 2026

dosubot bot added the app: website Related to website app label Feb 1, 2026

cubic-dev-ai bot reviewed Feb 1, 2026

View reviewed changes

src/utils/constants/dom-rules.ts Outdated Show resolved Hide resolved

cubewhy added 2 commits February 1, 2026 20:25

fix: load dom-rules in sync code

f2cbf97

fix: remove the test ruleset

594088d

cubewhy added 2 commits February 2, 2026 11:23

chore: correct comment position

c99864c

fix: remove url protocol before match pattern

01b23f9

mengxi-ream requested changes Feb 6, 2026

View reviewed changes

refactor: optimize findMathingSelectors

635f45a

Co-authored-by: MengXi <contact@mengxi.work>

Uh oh!

Conversation

cubewhy commented Feb 1, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Type of Changes

Description

Related Issue

How Has This Been Tested?

Screenshots

Checklist

Additional Information

Summary by cubic

Uh oh!

changeset-bot bot commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

dosubot bot commented Feb 1, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

claude bot commented Feb 1, 2026

PR Review: feat(dom-rules): Add 300+ dom rules

Overall Assessment

Key Insights

Issues to Address

1. Potential Breaking Change: YouTube rules removed

2. Performance: Pattern matching on every element

3. Massive duplication in JSON

4. Missing tests for new pattern matching logic

5. Edge case bug in matchUrlPattern

Minor Issues

What's Good

Recommendation

Uh oh!

ananaBMaster commented Feb 2, 2026

Uh oh!

cubewhy commented Feb 3, 2026

Uh oh!

claude bot commented Feb 4, 2026

Code Review: PR #931 - feat(dom-rules): Add 300+ dom rules

Summary

🟢 Good taste overall

✅ What Works Well

🔴 Issues to Address

1. Critical: Massive selector array duplication

2. Performance concern: Linear pattern matching

3. Fragile CSS class selectors

4. Missing tests for new functionality

🟡 Suggestions for Improvement

Test Coverage

Security

Final Verdict

Uh oh!

mengxi-ream left a comment

Choose a reason for hiding this comment

Uh oh!

mengxi-ream Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mengxi-ream Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mengxi-ream Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

mengxi-ream Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

TianmuTNT commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubewhy commented Feb 6, 2026

Uh oh!

ananaBMaster commented Feb 6, 2026

Uh oh!

cubewhy commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

cubewhy commented Feb 1, 2026 •

edited by cubic-dev-ai bot

Loading

changeset-bot bot commented Feb 1, 2026 •

edited

Loading

TianmuTNT commented Feb 6, 2026 •

edited

Loading

cubewhy commented Feb 6, 2026 •

edited

Loading

mengxi-ream commented Feb 6, 2026 •

edited

Loading

cubewhy commented Feb 6, 2026 •

edited

Loading