Skip to content

feat(dom-rules): Add 300+ dom rules#931

Open
cubewhy wants to merge 7 commits intomengxi-ream:mainfrom
cubewhy:add-dom-rules-refactor
Open

feat(dom-rules): Add 300+ dom rules#931
cubewhy wants to merge 7 commits intomengxi-ream:mainfrom
cubewhy:add-dom-rules-refactor

Conversation

@cubewhy
Copy link

@cubewhy cubewhy commented Feb 1, 2026

Type of Changes

  • ✨ New feature (feat)
  • 🐛 Bug fix (fix)
  • 📝 Documentation change (docs)
  • 💄 UI/style change (style)
  • ♻️ Code refactoring (refactor)
  • ⚡ Performance improvement (perf)
  • ✅ Test related (test)
  • 🔧 Build or dependencies update (build)
  • 🔄 CI/CD related (ci)
  • 🌐 Internationalization (i18n)
  • 🧠 AI model related (ai)
  • 🔄 Revert a previous commit (revert)
  • 📦 Other changes that do not modify src or test files (chore)

Description

  • Introduce 300+ dom rules from @TianmuTNT's fork
  • Refactor original dom-rules.ts for loading rules from json (I have merged the exist rules into the json file)

Related Issue

None

How Has This Been Tested?

Verified via a manually smoke test

  • Added unit tests
  • Verified through manual testing

Screenshots

None

Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly if necessary
  • My code follows the code style of this project
  • My changes do not break existing functionality
  • If my code was generated by AI, I have proofread and improved it as necessary.

I think I cannot access the source code of page https://www.readfrog.app/zh/tutorial/code-contribution/custom-dom-rules
So I cannot modify the docs.

Additional Information


Summary by cubic

Added 360+ website-specific DOM exclusion and block-translation rules, loaded from a JSON config with wildcard URL matching. This improves translation accuracy across popular sites and makes rules easier to maintain.

  • New Features

    • Added JSON-based DOM rules (dontWalkIntoSelectors, forceBlockTranslationSelectors).
    • Implemented wildcard URL pattern matching (*, **) with protocol-optional support.
  • Refactors

    • dom-rules.ts now loads rules from JSON and exposes findMatchingSelectors.
    • Updated filter.ts to use dynamic selector lookup.
    • Enabled resolveJsonModule in tsconfig.

Written for commit 635f45a. Summary will update on new commits.

@changeset-bot
Copy link

changeset-bot bot commented Feb 1, 2026

🦋 Changeset detected

Latest commit: 635f45a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@read-frog/extension Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 1, 2026
@dosubot
Copy link

dosubot bot commented Feb 1, 2026

Related Documentation

Checked 6 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@cubewhy cubewhy changed the title Add 300+ dom rules feat(dom-rules): Add 300+ dom rules Feb 1, 2026
@github-actions github-actions bot added the feat label Feb 1, 2026
@dosubot dosubot bot added the app: website Related to website app label Feb 1, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 5 files

Prompt for AI agents (all issues)

Check if these issues are valid — if so, understand the root cause of each and fix them.


<file name="src/utils/constants/dom-rules.ts">

<violation number="1" location="src/utils/constants/dom-rules.ts:152">
P2: findMatchingSelectors returns empty results while domRules is still loading asynchronously, and call sites use it synchronously. Early DOM scans can skip domain-specific blocking rules until the JSON import finishes, changing behavior from the previous synchronous constants.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@claude
Copy link

claude bot commented Feb 1, 2026

PR Review: feat(dom-rules): Add 300+ dom rules

Overall Assessment

🟢 Good addition - Bringing in 360+ website-specific DOM rules from a fork is valuable for improving translation accuracy across popular sites. The refactor to JSON-based configuration makes rules easier to maintain.


Key Insights

Data structure: JSON config with two rule types (dontWalkIntoSelectors, forceBlockTranslationSelectors) mapped by URL patterns. Clean separation.

Complexity: The glob-to-regex pattern matching is the main logic addition. Implementation is straightforward but has edge cases.

Risk points:

  • Existing tests may break due to changed imports/exports
  • Performance impact of pattern matching on every DOM check
  • Some rules in JSON duplicate the same selector arrays (e.g., Twitter variants)

Issues to Address

1. Potential Breaking Change: YouTube rules removed

The original CUSTOM_DONT_WALK_INTO_ELEMENT_SELECTOR_MAP had YouTube-specific rules using imported constants:

`.${SUBTITLES_VIEW_CLASS}`,
`.${STATE_MESSAGE_CLASS}`,
`.${TRANSLATE_BUTTON_CLASS}`,

These dynamic selectors based on JS constants are NOT present in the JSON file. The YouTube entry in dom-rules.json only has:

"www.youtube.com": [".ytp-caption-segment"]

This is a regression - the extension's own subtitle-related elements won't be excluded from translation.

Suggestion: Either add the actual class values to the JSON, or create a mechanism to merge runtime constants with JSON rules for YouTube.

2. Performance: Pattern matching on every element

findMatchingSelectors() is called for every element check via isCustomDontWalkIntoElement() and isCustomForceBlockTranslation(). Each call:

  • Creates a new URL object
  • Iterates through all 300+ patterns when no exact match

Suggestion: Cache the matched selectors per URL at the start of page translation rather than re-computing per element.

3. Massive duplication in JSON

Twitter/X rules are duplicated 10+ times for different subdomains. Same for Stack Exchange sites.

Suggestion: Consider using wildcard patterns more aggressively:

"*.twitter.com": [...],
"*.x.com": [...],

Or introduce rule inheritance/aliases to reduce the 2200-line JSON file.

4. Missing tests for new pattern matching logic

globToRegex() and matchUrlPattern() are new untested functions with edge cases:

  • Protocol handling (https:// vs http:// vs none)
  • Wildcard at different positions (*.domain.com, domain.com/*, **)
  • URL with query strings and fragments

The existing tests in custom-dont-walk.test.ts and custom-force-block.test.ts use setHost() which only tests hostname matching, not the new URL pattern features.

5. Edge case bug in matchUrlPattern

if (!pattern.startsWith('http://') && !pattern.startsWith('https://')) {
  return regex.test(url.replace(/^https?:\/\//, ''))
}

If pattern doesn't start with protocol but URL does, it strips protocol from URL. But if pattern contains path like github.com/user/*, the generated regex expects the pattern format, not the stripped URL format. This could cause false negatives.


Minor Issues

  1. Comment inconsistency (line 2340-2341):
if (!domRules)
  // Not initialized yet
  return []

Comment on wrong line - should be above or inline.

  1. Empty selector handling: If dontWalkIntoElementSelectorList is empty, joining produces "" which when passed to element.matches("") throws an error. The check if (!dontWalkSelector) handles this, but worth noting.

  2. Some JSON patterns use full URLs with https://:

"https://www.reddit.com/r/*/comments/*/*": [...]

While others use just hostnames. This inconsistency works due to the pattern matching but reduces clarity.


What's Good

  • Clean interface design (DomRulesConfig)
  • Proper TypeScript typing with the JSON module
  • Sensible fallback behavior (empty array when no rules match)
  • Good changeset documentation

Recommendation

Worth merging with fixes:

  1. Add YouTube constants to JSON (or implement runtime merge)
  2. Add caching for pattern matching results
  3. Add unit tests for globToRegex and matchUrlPattern

The core value (300+ rules) is solid. The implementation needs minor hardening before merge.

@ananaBMaster
Copy link
Collaborator

can I ask where are these rules originally from?

@cubewhy
Copy link
Author

cubewhy commented Feb 3, 2026

can I ask where are these rules originally from?

Sorry for the late reply

@TianmuTNT reversed some rules from the "Immersive Translate" plugin and wrote a script to convert to the read-frog format.

@claude
Copy link

claude bot commented Feb 4, 2026

Code Review: PR #931 - feat(dom-rules): Add 300+ dom rules

Summary

This PR adds 360+ website-specific DOM exclusion rules from @TianmuTNT's fork and refactors the rule loading system to use a JSON configuration file with wildcard URL pattern matching.


🟢 Good taste overall

The approach of externalizing rules to JSON is sound - it makes rules more maintainable and easier to contribute to without modifying TypeScript code.


✅ What Works Well

  1. JSON-based configuration: Moving rules to a JSON file is a good separation of concerns
  2. Wildcard pattern matching: The * and ** glob support is useful for matching subdomains and paths
  3. Protocol-agnostic matching: Stripping protocols for comparison is sensible
  4. Backward compatibility: Existing rules appear preserved, filter.ts changes are minimal

🔴 Issues to Address

1. Critical: Massive selector array duplication

The Twitter/X rules are duplicated across 10+ domains with identical selectors:

  • twitter.com
  • mobile.twitter.com
  • tweetdeck.twitter.com
  • pro.twitter.com
  • platform.twitter.com/embed*
  • x.com
  • mobile.x.com
  • tweetdeck.x.com
  • pro.x.com
  • platform.x.com/embed*

Recommendation: Use the wildcard support to consolidate:

"*.twitter.com": [...],
"*.x.com": [...],
"twitter.com": [...],
"x.com": [...]

Same issue with Stack Exchange sites (stackoverflow, superuser, askubuntu, serverfault, *.stackexchange.com), Quora domains, and Telegram domains.

DRY violation - ~500+ lines could be reduced to ~100.

2. Performance concern: Linear pattern matching

findMatchingSelectors iterates through all patterns on every call:

for (const [pattern, selectors] of Object.entries(ruleset || {})) {
  if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern)) {
    return selectors
  }
}

With 300+ patterns, this could be slow. The function correctly checks exact hostname match first (fast path), but consider:

  • Caching the matched selectors per URL/hostname
  • Or building a Map at initialization time for non-wildcard patterns

3. Fragile CSS class selectors

Many selectors use generated class names that may change:

"[class='css-901oao r-14j79pv r-37j5jr r-n6v787 r-16dba41 r-1cwl3u0 r-bcqeeo r-qvutc0']"
"[class='css-175oi2r r-1habvwh r-vqp9x9 r-1q9bdsx r-1loqt21 r-9njtsq r-1wtj0ep r-nsbfu8 r-xbdcod r-13c7hvr']"

These are brittle and will break when sites update their CSS bundler config. Consider using data-testid or semantic attributes where available.

4. Missing tests for new functionality

The matchUrlPattern and globToRegex functions lack unit tests. The existing test (custom-dont-walk.test.ts) tests chatgpt.com which exists in the old hardcoded rules but I don't see it in the new JSON.

Important: Add tests for:

  • Wildcard * matching (single segment)
  • Wildcard ** matching (multi-segment)
  • Protocol stripping behavior
  • Edge cases (empty patterns, malformed URLs)

🟡 Suggestions for Improvement

  1. Add schema validation: Consider adding a JSON schema or Zod validation for the rules file to catch typos/invalid selectors early

  2. Document the pattern format: Add a comment or README explaining the URL matching syntax (*, **, protocol handling)

  3. Consider lazy loading: The 2200-line JSON file is imported synchronously. For extension startup time, consider lazy loading if this grows further

  4. Inconsistent URL patterns: Some entries use full URLs with protocol (https://www.reddit.com/), others just hostnames (stackoverflow.com). Standardize the format

  5. Missing YouTube subtitles exclusions: The old code had these constants:

`${YOUTUBE_NATIVE_SUBTITLES_CLASS}`,
`.${SUBTITLES_VIEW_CLASS}`,
`.${STATE_MESSAGE_CLASS}`,
`.${TRANSLATE_BUTTON_CLASS}`,

These dynamic class references seem lost in the migration. Verify these are handled elsewhere.


Test Coverage

  • New matchUrlPattern function needs unit tests
  • New globToRegex function needs unit tests
  • Wildcard matching edge cases need coverage

Security

No security concerns identified. The rules are static JSON data with CSS selectors.


Final Verdict

Approve with changes requested. The core refactoring is good, but:

  1. The massive duplication should be fixed before merge
  2. Unit tests for the new URL matching logic are important
  3. Verify the YouTube subtitle class constants are handled correctly

The bulk of the work (collecting 300+ rules) is valuable for the community. Just needs some polish before merging.

Copy link
Owner

@mengxi-ream mengxi-ream left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最大的问题时规则好像有 bug,一些规则我猜测可能应该是“仅翻译”,把它弄成了不翻译。比如 Github 的效果是这样的:

Image

沉浸式翻译是:

Image

这两个翻译和不翻译的东西颠倒了。所以可能是规则在扒沉浸式翻译的规则的时候出现了问题。

能不能把扒规则的脚本之类的也发出来看下?

- News and media sites (Reuters, CNBC, New York Times)
- And many more

Also implemented wildcard pattern matching (`*`, `**`) and JSON-based configuration to support flexible URL patterns.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this changeset shorter


export function isCustomDontWalkIntoElement(element: HTMLElement): boolean {
const dontWalkIntoElementSelectorList = CUSTOM_DONT_WALK_INTO_ELEMENT_SELECTOR_MAP[window.location.hostname] ?? []
const dontWalkIntoElementSelectorList = findMatchingSelectors('dontWalkIntoSelectors', window.location.href)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findMatchingSelectors 并非 O(1) 会有严重性能问题


export function isCustomForceBlockTranslation(element: HTMLElement): boolean {
const forceBlockSelectorList = CUSTOM_FORCE_BLOCK_TRANSLATION_SELECTOR_MAP[window.location.hostname] ?? []
const forceBlockSelectorList = findMatchingSelectors('forceBlockTranslationSelectors', window.location.href)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

findMatchingSelectors 并非 O(1) 会有严重性能问题,应该作为一个参数传递

- News and media sites (Reuters, CNBC, New York Times)
- And many more

Also implemented wildcard pattern matching (`*`, `**`) and JSON-based configuration to support flexible URL patterns.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changeset 尽量简化,不要写很多,一两句话

@TianmuTNT
Copy link

TianmuTNT commented Feb 6, 2026

rules.json
@mengxi-ream

@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

wrote a script

In fact this is not generated by a script

from @TianmuTNT

image

@ananaBMaster
Copy link
Collaborator

能不能给原仓库开发者开放一下权限:

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂,很多没必要的 if else,代码太多了,不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

能不能给原仓库开发者开放一下权限:

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

image

🤔

@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

image

用笨办法了, 现在应该可以了

@mengxi-ream
Copy link
Owner

image 用笨办法了, 现在应该可以了

或许你可以直接复制粘贴一下上面 @ananaBMaster 的代码hhh

@mengxi-ream
Copy link
Owner

mengxi-ream commented Feb 6, 2026

rules.json @mengxi-ream

奇怪,看到 github 的确实是

  "additionalExcludeSelectors.add": [
            "[data-test-selector='commit-tease-commit-message']",
            "[data-test-selector='create-branch.developmentForm']",
            "div.Box-header.position-relative",

不过我也看到他们有

        "selectors": [
            "h1",
            "[aria-label=Issues] .markdown-title",
            "[aria-labelledby=discussions-list] .markdown-title",
            "h3 .markdown-title",
            ".markdown-body",
            ".Layout-sidebar p",
            "div > span.search-match",

是不是这两者结合有什么机制

目前看效果,应该 Github 的这个例子不太对的感觉。

@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

是不是这两者结合有什么机制

我认为 规则是这样运行的

选择 selectors 里的所有元素, 然后排除能被 additionalExcludeSelectors 选中的元素

现在这个 pr 只排除了 additionalExcludeSelectors, 导致有的东西被错误翻译了

我没看过沉浸式翻译的代码, 仅猜测

@mengxi-ream
Copy link
Owner

是不是这两者结合有什么机制

我认为 规则是这样运行的

选择 selectors 里的所有元素, 然后排除能被 additionalExcludeSelectors 选中的元素

现在这个 pr 只排除了 additionalExcludeSelectors, 导致有的东西被错误翻译了

我没看过沉浸式翻译的代码, 仅猜测

我也是这样想的,就是他们有某种集合相互影响的机制,可能某个会覆盖另外一个规则。

所以这个 PR 合并的话,还需要实现一下这个机制才行

Co-authored-by: MengXi <contact@mengxi.work>
@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

BTW, 看上去 eslint 不太喜欢用 .gitignorre
我用 nixos, 有 devenv 文件, eslint 会进去 lint, 导致无法提交

import { includeIgnoreFile } from '@eslint/compat'

const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const gitignorePath = path.resolve(__dirname, '.gitignore')

export default antfu({
    /*...*/
}).append(includeIgnoreFile(gitignorePath, 'Imported .gitignore patterns')).append(includeIgnoreFile(gitExcludePath, 'Imported .git/info/exclude patterns'))

@cubewhy
Copy link
Author

cubewhy commented Feb 6, 2026

能不能给原仓库开发者开放一下权限:

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'https://github.com/cubewhy/read-frog.git/': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂,很多没必要的 if else,代码太多了,不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

现在这个仓库里的代码是 @TianmuTNT Vibe 的
我觉得更快的方法是用 Trie Tree 来匹配 url
for 实在慢

@mengxi-ream
Copy link
Owner

devenv

BTW, 看上去 eslint 不太喜欢用 .gitignorre 我用 nixos, 有 devenv 文件, eslint 会进去 lint, 导致无法提交

import { includeIgnoreFile } from '@eslint/compat'

const __filename = fileURLToPath(import.meta.url)
const __dirname = path.dirname(__filename)
const gitignorePath = path.resolve(__dirname, '.gitignore')

export default antfu({
    /*...*/
}).append(includeIgnoreFile(gitignorePath, 'Imported .gitignore patterns')).append(includeIgnoreFile(gitExcludePath, 'Imported .git/info/exclude patterns'))

让 eslint 忽略对应文件应该就可以

@mengxi-ream
Copy link
Owner

能不能给原仓库开发者开放一下权限:

git push
remote: Permission to cubewhy/read-frog.git denied to mengxi-ream.
fatal: unable to access 'cubewhy/read-frog.git': The requested URL returned error: 403

关于你的 dom-rule.ts 你的写法太过复杂,很多没必要的 if else,代码太多了,不够精炼。我在改完之后发现没法 push

import domRulesModule from '@/assets/dom-rules.json'

export interface DomRulesConfig {
  dontWalkIntoSelectors?: Record<string, string[]>
  forceBlockTranslationSelectors?: Record<string, string[]>
}

const domRules: DomRulesConfig = domRulesModule as DomRulesConfig

export const FORCE_BLOCK_TAGS = new Set([
  'BODY',
  'H1',
  'H2',
  'H3',
  'H4',
  'H5',
  'H6',
  'BR',
  'FORM',
  'SELECT',
  'BUTTON',
  'LABEL',
  'UL',
  'OL',
  'LI',
  'BLOCKQUOTE',
  'PRE',
  'ARTICLE',
  'SECTION',
  'FIGURE',
  'FIGCAPTION',
  'HEADER',
  'FOOTER',
  'MAIN',
  'NAV',
])

export const MATH_TAGS = new Set([
  'math',
  'maction',
  'annotation',
  'annotation-xml',
  'menclose',
  'merror',
  'mfenced',
  'mfrac',
  'mi',
  'mmultiscripts',
  'mn',
  'mo',
  'mover',
  'mpadded',
  'mphantom',
  'mprescripts',
  'mroot',
  'mrow',
  'ms',
  'mspace',
  'msqrt',
  'mstyle',
  'msub',
  'msubsup',
  'msup',
  'mtable',
  'mtd',
  'mtext',
  'mtr',
  'munder',
  'munderover',
  'semantics',
])

// Don't walk into these tags
export const DONT_WALK_AND_TRANSLATE_TAGS = new Set([
  'HEAD',
  'TITLE',
  'HR',
  'INPUT',
  'TEXTAREA',
  'IMG',
  'VIDEO',
  'AUDIO',
  'CANVAS',
  'SOURCE',
  'TRACK',
  'META',
  'SCRIPT',
  'NOSCRIPT',
  'STYLE',
  'LINK',
  'PRE',
  'svg',
  ...MATH_TAGS,
])

export const DONT_WALK_BUT_TRANSLATE_TAGS = new Set([
  'CODE',
  'TIME',
])

export const FORCE_INLINE_TRANSLATION_TAGS = new Set([
  'A',
  'BUTTON',
  'SELECT',
  'OPTION',
  'SPAN',
])

export const MAIN_CONTENT_IGNORE_TAGS = new Set(['HEADER', 'FOOTER', 'NAV', 'NOSCRIPT'])

/**
 * Convert glob pattern to RegExp for URL matching
 * Supports: * (single segment) and ** (any depth)
 */
function globToRegex(pattern: string): RegExp {
  let regexStr = pattern.replace(/[.+?^${}()|[\]\\]/g, '\\$&')

  regexStr = regexStr.replace(/\*\*/g, '§DBL§')
  regexStr = regexStr.replace(/\*/g, '[^/]*')
  regexStr = regexStr.replace(/§DBL§/g, '.*')

  return new RegExp(`^${regexStr}$`, 'i')
}

/** Protocol-agnostic URL matching with glob support */
export function matchUrlPattern(url: string, pattern: string): boolean {
  const cleanUrl = url.replace(/^https?:\/\//, '')
  const cleanPattern = pattern.replace(/^https?:\/\//, '')

  if (!pattern.includes('*'))
    return cleanUrl === cleanPattern

  return globToRegex(cleanPattern).test(cleanUrl)
}

export function findMatchingSelectors(
  ruleName: 'dontWalkIntoSelectors' | 'forceBlockTranslationSelectors',
  currentUrl?: string,
): string[] {
  const ruleset = domRules?.[ruleName]
  if (!ruleset)
    return []

  const url = currentUrl || window.location.href
  const hostname = new URL(url).hostname

  // Fast path: O(1) exact key lookup before O(n) glob matching
  if (ruleset[hostname])
    return ruleset[hostname]

  for (const [pattern, selectors] of Object.entries(ruleset)) {
    if (matchUrlPattern(url, pattern) || matchUrlPattern(hostname, pattern))
      return selectors
  }

  return []
}

现在这个仓库里的代码是 @TianmuTNT Vibe 的 我觉得更快的方法是用 Trie Tree 来匹配 url for 实在慢

一个是可以用 trie,另一个是可以只跑一次,把结果作为参数往下传递。

@claude
Copy link

claude bot commented Feb 8, 2026

Code Review: feat(dom-rules): Add 300+ dom rules

Overall Assessment

The feature concept is valuable -- moving from a small hardcoded map to a comprehensive JSON-based rule system with glob pattern matching is the right direction. However, there are several significant issues that should be addressed before merging.

Critical Issues

  1. Lost YouTube selectors that reference runtime constants:
    The original dom-rules.ts had YouTube selectors that use imported constants:

    YOUTUBE_NATIVE_SUBTITLES_CLASS
    SUBTITLES_VIEW_CLASS
    STATE_MESSAGE_CLASS
    TRANSLATE_BUTTON_CLASS

    These are dynamic class names imported from ./subtitles. The new JSON-based approach cannot represent these since JSON is static. The YouTube entry in dom-rules.json does not include equivalents for these selectors. This is a regression that could cause the extension's own UI elements (subtitles overlay, translate buttons) to be incorrectly walked into and translated.

  2. Lost existing selectors for www.reddit.com and www.youtube.com:
    The original code had more specific selectors (e.g., Reddit's faceplate-screen-reader-content > *, reddit-header-large *; YouTube's #masthead-container *, #guide-inner-content *, #metadata *, etc.) that are not present in the JSON file. The JSON has different Reddit/YouTube selectors from the fork, but the originals were lost. These need to be merged, not replaced.

  3. Lost github.com selectors:
    Original had [aria-labelledby="folders-and-files"] *, header *, #repository-container-header *, [class*="OverviewContent-module__Box_1--"] *. The JSON has different GitHub selectors. Again, merge needed rather than replacement.

Performance Concerns

  1. findMatchingSelectors is called on every DOM element check:
    isCustomDontWalkIntoElement and isCustomForceBlockTranslation are called during DOM tree walking, potentially thousands of times per page. Each call to findMatchingSelectors iterates over all URL patterns in the JSON (300+ entries) with regex compilation via globToRegex. This regex is recompiled on every call.

    Recommendation: Cache the compiled regexes (build them once at module load or first use) and cache the findMatchingSelectors result per URL (memoize). The URL doesn't change during a page session, so this result should be computed once.

  2. Massive JSON duplication inflates bundle size:
    The dom-rules.json file is ~2200 lines with enormous duplication. For example, twitter.com, mobile.twitter.com, tweetdeck.twitter.com, pro.twitter.com, platform.twitter.com/embed*, x.com, mobile.x.com, tweetdeck.x.com, pro.x.com, platform.x.com/embed* all have the exact same 17 selectors copy-pasted 10 times. Similarly, Stack Exchange sites (stackoverflow.com, *.stackexchange.com, superuser.com, askubuntu.com, serverfault.com) duplicate the same selectors 5 times despite *.stackexchange.com already covering most of them.

    The wildcard matching already supports patterns like *.twitter.com or *.x.com which would cover all subdomains. This JSON could be reduced by ~30-40% by deduplicating.

Code Quality

  1. globToRegex sentinel placeholder is fragile:
    Using a special string as a temporary placeholder works but is fragile. If any URL pattern ever contains this literal string, matching would break. A safer approach would be to process ** before escaping, or use a different transformation strategy.

  2. findMatchingSelectors returns only the first match:
    The current logic returns early on the first matching pattern. If a URL matches multiple patterns (e.g., both *.google.com and a more specific www.google.*/search*), only one set of selectors is returned. This could be intentional, but the order depends on Object.entries() iteration order which follows JSON key order. This behavior should be documented or selectors should be merged across all matching patterns.

  3. Missing arxiv.org selectors:
    The original had arxiv.org: ['.ltx_listing'] which is not in the JSON. The JSON has browse.arxiv.org and arxiv.org/html/* with different selectors, but the base arxiv.org pattern for .ltx_listing is lost.

Test Coverage

  • No unit tests for globToRegex or matchUrlPattern functions
  • No unit tests for findMatchingSelectors
  • The PR CI is missing the "Test & Build" check -- this should pass before merge
  • These are pure functions that are easy to test and critical for correctness

Security

  • No security concerns. The JSON is static data bundled with the extension.

Suggestions

  1. Must fix: Merge existing selectors (especially YouTube dynamic class selectors, Reddit, GitHub) with the new JSON rules rather than replacing them. The YouTube dynamic selectors may need to remain in TypeScript code.
  2. Must fix: Add memoization/caching to findMatchingSelectors -- the URL doesn't change mid-page, so compute once and cache.
  3. Should fix: Deduplicate the JSON by leveraging wildcards (e.g., *.twitter.com instead of 5 separate twitter subdomains).
  4. Should fix: Add unit tests for globToRegex, matchUrlPattern, and findMatchingSelectors.
  5. Nice to have: Pre-compile glob regexes at module load time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

app: website Related to website app feat size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments