feat(agent): add web_fetch local tool with tests#2799
feat(agent): add web_fetch local tool with tests#2799wwwillchen-bot wants to merge 9 commits intomainfrom
Conversation
|
@BugBot run |
Summary of ChangesHello @wwwillchen-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant enhancement to the agent's capabilities by adding a versatile web fetching tool. This allows the agent to programmatically retrieve and process web content, expanding its ability to gather information from external sources. The change also includes necessary refactoring to maintain code quality and ensure the new functionality is well-tested and integrated. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
@BugBot run |
Greptile SummaryThis PR adds a new Key Changes:
Security Assessment:
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
Start([web_fetch called]) --> Validate[validateHttpUrl: Check protocol & private IP]
Validate -->|Invalid| Error1[Throw error]
Validate -->|Valid| DNS[resolveAndValidateHost: DNS lookup]
DNS -->|Resolves to private IP| Error2[Throw error]
DNS -->|Public IP| Fetch[Fetch with timeout & redirect:manual]
Fetch --> CheckStatus{Status code?}
CheckStatus -->|301-308 redirect| CheckRedirCount{Redirect count < 5?}
CheckRedirCount -->|No| Error3[Throw: Too many redirects]
CheckRedirCount -->|Yes| ValidateRedirect[Validate redirect URL & DNS]
ValidateRedirect -->|Private IP| Error4[Throw error]
ValidateRedirect -->|Valid| Fetch
CheckStatus -->|Non-2xx| Error5[Throw: Request failed]
CheckStatus -->|2xx| CheckSize{Content-Length > 5MB?}
CheckSize -->|Yes| Error6[Throw: Response too large]
CheckSize -->|No| Stream[Stream body with size limit]
Stream -->|Exceeds 5MB| Error7[Cancel & throw error]
Stream -->|Success| CheckMime{Binary content?}
CheckMime -->|Image/Binary| ReturnBinary[Return binary summary]
CheckMime -->|Text-like| Convert[Convert HTML if needed]
Convert --> Truncate[Truncate to 60K chars]
Truncate --> Success([Return content])
Last reviewed commit: 6b234fb |
- Move & entity replacement to last position in decodeHtmlEntities to prevent double-unescaping (e.g. &lt; → < → <) - Add \s* before > in closing tag regexes in removeNonContentTags to match tags with whitespace like </script > Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@BugBot run |
✅ Claude Code completed successfullySummary
DetailsCodeQL Fix 1: Double HTML entity unescaping
CodeQL Fix 2: Bad HTML filtering regexp
|
|
|
||
| function removeNonContentTags(html: string): string { | ||
| return html | ||
| .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, " ") |
Check failure
Code scanning / CodeQL
Bad HTML filtering regexp High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI about 2 months ago
In general, the problem is that the regexes in removeNonContentTags assume that closing tags like </script> have only optional whitespace before the >. Browsers accept </script ...> and similar forms as end tags, so content between <script> and such a malformed closing tag will not be removed. To fix this, the end-tag patterns should be relaxed to allow any characters except > between the tag name and the closing >, not just whitespace. We should apply the same fix consistently to all the non-content tag removals using the same pattern shape.
The best targeted fix is to change each closing-tag fragment from </tag\s*> to </tag[^>]*>. This preserves the current behavior (non-greedy match between start and end, [\s\S]*?, and case-insensitivity) while broadening the recognized end-tag forms to include attributes, tabs, newlines, and other characters up to the >. Concretely, in src/pro/main/ipc/handlers/local_agent/tools/web_fetch.ts inside removeNonContentTags, update lines 211–216:
- Change
/<script\b[^>]*>[\s\S]*?<\/script\s*>/gito/<script\b[^>]*>[\s\S]*?<\/script[^>]*>/gi - Change
/<style\b[^>]*>[\s\S]*?<\/style\s*>/gito/<style\b[^>]*>[\s\S]*?<\/style[^>]*>/gi - Change
/<noscript\b[^>]*>[\s\S]*?<\/noscript\s*>/gito/<noscript\b[^>]*>[\s\S]*?<\/noscript[^>]*>/gi - Change
/<iframe\b[^>]*>[\s\S]*?<\/iframe\s*>/gito/<iframe\b[^>]*>[\s\S]*?<\/iframe[^>]*>/gi - Change
/<object\b[^>]*>[\s\S]*?<\/object\s*>/gito/<object\b[^>]*>[\s\S]*?<\/object[^>]*>/gi - Change
/<embed\b[^>]*>[\s\S]*?<\/embed\s*>/gito/<embed\b[^>]*>[\s\S]*?<\/embed[^>]*>/gi
No new imports or dependencies are needed; we are only adjusting the regex literals already in use.
| @@ -208,12 +208,12 @@ | ||
| // Handles common tags; approximate conversion is acceptable for this tool's use case. | ||
| function removeNonContentTags(html: string): string { | ||
| return html | ||
| .replace(/<script\b[^>]*>[\s\S]*?<\/script\s*>/gi, " ") | ||
| .replace(/<style\b[^>]*>[\s\S]*?<\/style\s*>/gi, " ") | ||
| .replace(/<noscript\b[^>]*>[\s\S]*?<\/noscript\s*>/gi, " ") | ||
| .replace(/<iframe\b[^>]*>[\s\S]*?<\/iframe\s*>/gi, " ") | ||
| .replace(/<object\b[^>]*>[\s\S]*?<\/object\s*>/gi, " ") | ||
| .replace(/<embed\b[^>]*>[\s\S]*?<\/embed\s*>/gi, " "); | ||
| .replace(/<script\b[^>]*>[\s\S]*?<\/script[^>]*>/gi, " ") | ||
| .replace(/<style\b[^>]*>[\s\S]*?<\/style[^>]*>/gi, " ") | ||
| .replace(/<noscript\b[^>]*>[\s\S]*?<\/noscript[^>]*>/gi, " ") | ||
| .replace(/<iframe\b[^>]*>[\s\S]*?<\/iframe[^>]*>/gi, " ") | ||
| .replace(/<object\b[^>]*>[\s\S]*?<\/object[^>]*>/gi, " ") | ||
| .replace(/<embed\b[^>]*>[\s\S]*?<\/embed[^>]*>/gi, " "); | ||
| } | ||
|
|
||
| function extractTextFromHtml(html: string): string { |
|
@BugBot run |
✅ Claude Code completed successfullySummary
Review Comments Addressed
Unresolvable Threads (CodeQL Alerts)
Key Code Changes
E2E Snapshots:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e07088d1ba
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| response = await fetch(currentUrl, { | ||
| signal: abortController.signal, | ||
| redirect: "manual", |
There was a problem hiding this comment.
Bind fetch to validated address to prevent DNS rebinding
The SSRF guard validates hostnames with resolveAndValidateHost(), but the subsequent fetch(currentUrl, ...) still resolves DNS again from the hostname, so an attacker-controlled domain can return a public IP during validation and then rebind to a private/internal IP before the actual request. In that scenario the tool can still reach blocked networks despite the private-IP checks. Please pin requests to the validated address (or enforce a resolver/connection policy that reuses the checked IP) for each hop.
Useful? React with 👍 / 👎.
…omments Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@BugBot run |
Cast vi.mocked(lookup) to any for the array-returning overload
used with { all: true } option, fixing TS2739 errors in CI.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@BugBot run |
✅ Claude Code completed successfullySummary
DetailsCI Failure (macOS build): Fix: Changed Untrusted commenters skipped: |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6b234fb5ca
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| response = await fetch(currentUrl, { | ||
| signal: abortController.signal, | ||
| redirect: "manual", | ||
| headers: { |
There was a problem hiding this comment.
Pin outbound request to validated address
The SSRF guard can be bypassed via DNS rebinding because resolveAndValidateHost() validates one DNS answer, but fetch(currentUrl, ...) resolves the hostname again when opening the socket. For attacker-controlled domains that return a public IP during validation and a private/internal IP on the subsequent fetch resolution, this code will still issue the request to the internal target. To make the private-network block effective, the request needs to be bound to the validated IP (or use a custom lookup/connection policy that enforces the checked addresses).
Useful? React with 👍 / 👎.
| function isPrivateIp(hostname: string): boolean { | ||
| // IPv6 loopback | ||
| if (hostname === "[::1]" || hostname === "::1") return true; | ||
|
|
||
| // Strip brackets for IPv6 | ||
| const ip = hostname.replace(/^\[|\]$/g, ""); | ||
| const lowerIp = ip.toLowerCase(); | ||
|
|
||
| // Unspecified address | ||
| if (lowerIp === "::" || /^0(:0){7}$/.test(lowerIp)) return true; | ||
|
|
||
| // Unique local addresses (fc00::/7) | ||
| if (/^f[cd]/i.test(lowerIp)) return true; | ||
|
|
||
| // Link-local addresses (fe80::/10) | ||
| if (/^fe[89ab]/i.test(lowerIp)) return true; | ||
|
|
||
| // IPv4-mapped IPv6 in dotted form (::ffff:x.x.x.x) | ||
| const v4MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/); | ||
| if (v4MappedMatch) { | ||
| return isPrivateIpv4(v4MappedMatch[1]); | ||
| } | ||
|
|
||
| // IPv4-mapped IPv6 in hex form (e.g. ::ffff:7f00:1 = ::ffff:127.0.0.1) | ||
| const v4MappedHexMatch = lowerIp.match( | ||
| /^::ffff:([0-9a-f]{1,4}):([0-9a-f]{1,4})$/, | ||
| ); | ||
| if (v4MappedHexMatch) { | ||
| const high = parseInt(v4MappedHexMatch[1], 16); | ||
| const low = parseInt(v4MappedHexMatch[2], 16); | ||
| const reconstructed = `${(high >> 8) & 0xff}.${high & 0xff}.${(low >> 8) & 0xff}.${low & 0xff}`; | ||
| return isPrivateIpv4(reconstructed); | ||
| } | ||
|
|
||
| // IPv4 patterns | ||
| return isPrivateIpv4(ip); | ||
| } |
There was a problem hiding this comment.
🔴 IPv6 prefix checks in isPrivateIp falsely block legitimate hostnames
The isPrivateIp function applies IPv6 ULA and link-local prefix regex checks to raw hostnames without first verifying the input is actually an IPv6 address. This causes legitimate domains to be rejected as "private or internal network addresses".
Root Cause and Impact
The function is called with parsed.hostname in validateHttpUrl (web_fetch.ts:146), which is a regular domain name for most URLs. The IPv6 checks at lines 70 and 73 use regexes that match common hostname prefixes:
/^f[cd]/i(line 70, intended for fc00::/7 ULA addresses) matches any hostname starting with "fc" or "fd" — e.g.fcc.gov,fdic.gov,fcbarcelona.com,fca.com/^fe[89ab]/i(line 73, intended for fe80::/10 link-local addresses) matches any hostname starting with "fe" + 8/9/a/b — e.g.feather.io,features.example.com,feast.dev
Trace for https://fcc.gov/:
validateHttpUrl→parsed.hostname = "fcc.gov"isPrivateIp("fcc.gov")→ line 70:/^f[cd]/i.test("fcc.gov")→ true- Error thrown: "URL points to a private or internal network address"
The fix is to only apply IPv6 prefix checks when the input actually contains : (a character present in all IPv6 addresses but not in domain names).
Impact: Users cannot fetch content from any website whose domain starts with fc, fd, fe8, fe9, fea, or feb. This includes US government sites like fcc.gov and fdic.gov.
| function isPrivateIp(hostname: string): boolean { | |
| // IPv6 loopback | |
| if (hostname === "[::1]" || hostname === "::1") return true; | |
| // Strip brackets for IPv6 | |
| const ip = hostname.replace(/^\[|\]$/g, ""); | |
| const lowerIp = ip.toLowerCase(); | |
| // Unspecified address | |
| if (lowerIp === "::" || /^0(:0){7}$/.test(lowerIp)) return true; | |
| // Unique local addresses (fc00::/7) | |
| if (/^f[cd]/i.test(lowerIp)) return true; | |
| // Link-local addresses (fe80::/10) | |
| if (/^fe[89ab]/i.test(lowerIp)) return true; | |
| // IPv4-mapped IPv6 in dotted form (::ffff:x.x.x.x) | |
| const v4MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/); | |
| if (v4MappedMatch) { | |
| return isPrivateIpv4(v4MappedMatch[1]); | |
| } | |
| // IPv4-mapped IPv6 in hex form (e.g. ::ffff:7f00:1 = ::ffff:127.0.0.1) | |
| const v4MappedHexMatch = lowerIp.match( | |
| /^::ffff:([0-9a-f]{1,4}):([0-9a-f]{1,4})$/, | |
| ); | |
| if (v4MappedHexMatch) { | |
| const high = parseInt(v4MappedHexMatch[1], 16); | |
| const low = parseInt(v4MappedHexMatch[2], 16); | |
| const reconstructed = `${(high >> 8) & 0xff}.${high & 0xff}.${(low >> 8) & 0xff}.${low & 0xff}`; | |
| return isPrivateIpv4(reconstructed); | |
| } | |
| // IPv4 patterns | |
| return isPrivateIpv4(ip); | |
| } | |
| function isPrivateIp(hostname: string): boolean { | |
| // IPv6 loopback | |
| if (hostname === "[::1]" || hostname === "::1") return true; | |
| // Strip brackets for IPv6 | |
| const ip = hostname.replace(/^\[|\]$/g, ""); | |
| const lowerIp = ip.toLowerCase(); | |
| // Only apply IPv6-specific checks if the value looks like an IPv6 address | |
| if (lowerIp.includes(":")) { | |
| // Unspecified address | |
| if (lowerIp === "::" || /^0(:0){7}$/.test(lowerIp)) return true; | |
| // Unique local addresses (fc00::/7) | |
| if (/^f[cd]/i.test(lowerIp)) return true; | |
| // Link-local addresses (fe80::/10) | |
| if (/^fe[89ab]/i.test(lowerIp)) return true; | |
| // IPv4-mapped IPv6 in dotted form (::ffff:x.x.x.x) | |
| const v4MappedMatch = lowerIp.match(/^::ffff:(\d+\.\d+\.\d+\.\d+)$/); | |
| if (v4MappedMatch) { | |
| return isPrivateIpv4(v4MappedMatch[1]); | |
| } | |
| // IPv4-mapped IPv6 in hex form (e.g. ::ffff:7f00:1 = ::ffff:127.0.0.1) | |
| const v4MappedHexMatch = lowerIp.match( | |
| /^::ffff:([0-9a-f]{1,4}):([0-9a-f]{1,4})$/, | |
| ); | |
| if (v4MappedHexMatch) { | |
| const high = parseInt(v4MappedHexMatch[1], 16); | |
| const low = parseInt(v4MappedHexMatch[2], 16); | |
| const reconstructed = `${(high >> 8) & 0xff}.${high & 0xff}.${(low >> 8) & 0xff}.${low & 0xff}`; | |
| return isPrivateIpv4(reconstructed); | |
| } | |
| } | |
| // IPv4 patterns | |
| return isPrivateIpv4(ip); | |
| } |
Was this helpful? React with 👍 or 👎 to provide feedback.
🔍 Dyadbot Code Review SummaryVerdict: ✅ YES - Ready to merge Reviewed by 3 independent agents: Correctness Expert, Code Health Expert, UX Wizard. Note: Previous review's issues have been largely addressed. This review covers the current state of the diff. Issues Summary
🟢 Low Priority Notes (9 items)
🚫 Dropped False Positives (6 items)
Generated by Dyadbot multi-agent code review |
| if (a === 0) return true; // 0.0.0.0/8 | ||
| } | ||
| return false; | ||
| } |
There was a problem hiding this comment.
🟡 MEDIUM | security
isPrivateIpv4 does not block 100.64.0.0/10 (Carrier-Grade NAT) range
The function blocks RFC 1918 ranges, loopback, link-local, and 0.0.0.0/8, but not 100.64.0.0/10 (RFC 6598 Carrier-Grade NAT). Cloud providers like AWS use addresses in this range for internal VPC endpoints. Similarly, 198.18.0.0/15 (benchmark testing) and 240.0.0.0/4 (reserved) are not blocked.
Note: Lower risk in a desktop Electron context than in a server-side environment.
💡 Suggestion:
if (a === 100 && b >= 64 && b <= 127) return true; // 100.64.0.0/10 Carrier-Grade NAT
if (a === 198 && (b === 18 || b === 19)) return true; // 198.18.0.0/15 benchmark
if (a >= 240) return true; // 240.0.0.0/4 reserved| } | ||
|
|
||
| if (!response.ok) { | ||
| throw new Error(`Request failed with status code: ${response.status}`); |
There was a problem hiding this comment.
🟡 MEDIUM | error-messages
HTTP error message only shows status code, not status text
When a fetch returns a non-2xx status (e.g., 403, 429, 500), the error message is Request failed with status code: 404. The agent gets no indication of why the request failed. Common cases like 403 (blocked by site), 429 (rate limited), and 503 (service unavailable) would benefit from human-readable context so the agent can decide whether to retry.
💡 Suggestion: Include response.statusText:
throw new Error(`Request failed: ${response.status} ${response.statusText}`);
Summary
web_fetchlocal agent tool that fetches HTTP(S) content with format conversion, size/time limits, and binary/text handlingweb_fetchin the local agent tool definitions so it is available to the toolsetlocal_agent_handlerimport/type issues so type checks passTest plan
🤖 Generated with Claude Code