feat: add url validation #1829

mfts · 2025-08-23T13:34:45Z

Summary by CodeRabbit

New Features
- Stronger URL and file-path validation to block unsafe inputs (SSRF, traversal, double-encoding).
- Server-side validation for document uploads across relevant endpoints with clear error responses.
- Webhook file URLs now require HTTPS and run enhanced security checks.
- Pre-upload content-type check for webhook files that logs informative warnings on mismatches.
Bug Fixes
- Fewer failed uploads and processing errors due to stricter input validation.
Chores
- Pinned Trigger.dev CLI scripts to v3 for consistent deployments.

coderabbitai · 2025-08-23T13:34:52Z

Walkthrough

Adds a new URL/path validation module with SSRF/path-security checks and Zod schemas, applies server-side document payload validation across multiple document APIs and webhook ingestion, adds a content-type precheck for fetched webhook files, and pins two npm scripts to Trigger.dev CLI v3.

Changes

Cohort / File(s)	Summary
Security & URL/Path Validation Module `lib/zod/url-validation.ts`	New module: path traversal and double-encoding checks, SSRF protection, combined URL security validator, `filePathSchema`, `documentUploadSchema`, and `webhookFileUrlSchema`. Enforces storage-type/URL consistency and MIME/type checks; recognizes Notion hosts and optional VERCEL_BLOB_HOST.
Document API Validation `pages/api/teams/[teamId]/documents/index.ts`, `pages/api/teams/[teamId]/documents/agreement.ts`, `pages/api/teams/[teamId]/documents/[id]/versions/index.ts`	Introduces Zod-based validation using `documentUploadSchema.safeParse`; returns 400 on failure with logged errors; on success consumes `validationResult.data` for downstream processing (no change to core creation/versioning logic).
Webhook Ingestion Hardening `pages/api/webhooks/services/[...path]/index.ts`	Replaces generic URL validation with `webhookFileUrlSchema` (HTTPS + SSRF/path checks). Adds non-fatal response content-type precheck after fetching the file (logs warning on mismatch) before buffering/uploading; updates step annotations.
Tooling Scripts `package.json`	Pins Trigger.dev CLI to v3 in scripts: `npx trigger.dev@3 dev` and `npx trigger.dev@3 deploy` (replacing `@latest`).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant API as API: Team Documents
  participant Validator as Zod Schemas
  participant Store as Storage/DB

  Client->>API: POST /teams/:teamId/documents (body)
  API->>Validator: documentUploadSchema.safeParse(body)
  alt Invalid payload
    API-->>Client: 400 Bad Request (validation errors)
  else Valid payload
    API->>Store: Create/Process Document (using validated data)
    Store-->>API: Result
    API-->>Client: 200 OK (document info)
  end

sequenceDiagram
  autonumber
  participant WH as Webhook Source
  participant API as Webhook Handler
  participant V as URL Security (webhookFileUrlSchema)
  participant Net as HTTP Fetch
  participant Storage as Upload Target
  participant DB as DB

  WH->>API: DocumentCreate event (fileUrl, contentType, ...)
  API->>V: Validate fileUrl (HTTPS + SSRF/path checks)
  alt Invalid URL
    API-->>WH: 400/ignore (per handler policy)
  else Valid URL
    API->>Net: Fetch fileUrl (HEAD/GET)
    Net-->>API: Response (headers, stream)
    API->>API: Compare response content-type vs expected (log warn if mismatch)
    API->>API: Convert to buffer
    API->>Storage: Upload buffer
    Storage-->>API: URL/path
    API->>DB: Create document record
    DB-->>API: Record
    API-->>WH: Ack/Success
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Tip

🔌 Remote MCP (Model Context Protocol) integration is now available!

Pro plan users can now connect to remote MCP servers from the Integrations page. Connect with popular remote MCPs such as Notion and Linear to add more context to your reviews and chats.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 3103b43 and f626372.

📒 Files selected for processing (3)

lib/zod/url-validation.ts (1 hunks)
pages/api/teams/[teamId]/documents/agreement.ts (4 hunks)
pages/api/webhooks/services/[...path]/index.ts (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

pages/api/teams/[teamId]/documents/agreement.ts
pages/api/webhooks/services/[...path]/index.ts
lib/zod/url-validation.ts

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

vercel · 2025-08-23T13:34:52Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Comments	Updated (UTC)
papermark	Ready	Preview	Comment	Aug 23, 2025 1:58pm

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

pages/api/teams/[teamId]/documents/[id]/versions/index.ts (1)
95-114: Remove duplicate code block

Lines 105-114 are an exact duplicate of lines 95-103. This appears to be a copy-paste error.
   // turn off isPrimary flag for all other versions
   await prisma.documentVersion.updateMany({
     where: {
       documentId: documentId,
       id: { not: version.id },
     },
     data: {
       isPrimary: false,
     },
   });

-  // turn off isPrimary flag for all other versions
-  await prisma.documentVersion.updateMany({
-    where: {
-      documentId: documentId,
-      id: { not: version.id },
-    },
-    data: {
-      isPrimary: false,
-    },
-  });
pages/api/webhooks/services/[...path]/index.ts (2)
286-299: Guard against SSRF via redirects, unbounded downloads, and hung connections

Right now we validate only the initial URL and perform an unbounded fetch. Redirects could bypass initial checks, large files can exhaust memory, and a slow origin can hang the request.

Apply the following patch to add a timeout, re‑validate the final URL after redirects, and reject obviously too‑large payloads via Content-Length:
-  // 4. Fetch file from URL
-  const response = await fetch(fileUrl);
+  // 4. Fetch file from URL with timeout and safety checks
+  const controller = new AbortController();
+  const timeout = setTimeout(() => controller.abort(), 30_000); // 30s safety timeout
+  let response: Response;
+  try {
+    response = await fetch(fileUrl, { signal: controller.signal });
+  } finally {
+    clearTimeout(timeout);
+  }
   if (!response.ok) {
     return res.status(400).json({ error: "Failed to fetch file from URL" });
   }
+
+  // 4b. Re-validate the final response URL (guards against SSRF via redirects)
+  try {
+    webhookFileUrlSchema.parse(response.url);
+  } catch {
+    return res
+      .status(400)
+      .json({ error: "File URL redirect violated security policy" });
+  }
+
+  // 4c. Enforce a sane max size using Content-Length when available
+  const MAX_BYTES =
+    Number(process.env.WEBHOOK_MAX_FILE_BYTES ?? 50 * 1024 * 1024); // 50 MB default
+  const contentLength = response.headers.get("content-length");
+  if (contentLength && Number(contentLength) > MAX_BYTES) {
+    return res.status(413).json({ error: "File too large" });
+  }
 
-  // 5. Validate response content type matches expected
-  const responseContentType = response.headers.get("content-type");
-  if (responseContentType && !responseContentType.startsWith(contentType)) {
-    console.warn(
-      `Content type mismatch: expected ${contentType}, got ${responseContentType}`,
-    );
-    // Log but don't fail - some services return generic types
-  }
+  // 5. Validate response content type matches expected
+  const responseContentType = response.headers.get("content-type");
+  if (responseContentType && !responseContentType.startsWith(contentType)) {
+    console.warn(
+      `Content type mismatch: expected ${contentType}, got ${responseContentType}`,
+    );
+    // Log but don't fail - some services return generic types
+  }
Follow-up (optional next step): stream with a byte cap instead of arrayBuffer() to hard-limit memory; if putFileServer can accept streams, we can wire that in a separate change.

286-299: Harden untrusted external fetch() calls with timeouts and redirect checks

We ran a repo-wide search and found numerous fetch(...) usages—many of which pull from user-provided URLs (e.g. fileUrl, url) without any timeout or redirect validation. To prevent hangs, redirect loops, or unwanted host access, please audit and harden each external fetch by:

Wrapping in an AbortController and enforcing a sensible timeout (e.g. 30 seconds)

Setting redirect: 'manual' (or equivalent) and explicitly handling non-OK redirect responses

Verifying the request URL against an allowlist or validating its hostname

Applying any necessary size limits or content-type checks

Key locations to address immediately:

pages/api/webhooks/services/[...path]/index.ts (line 286):
const response = await fetch(fileUrl);

pages/api/mupdf/get-pages.ts (line 25):
const response = await fetch(url);

lib/trigger/optimize-video-files.ts (line 43):
const response = await fetch(fileUrl);

lib/files/bulk-download.ts (lines 67–70):
const response = await fetch(url);

lib/trigger/pdf-to-image-route.ts
– line 64: const response = await fetch(.../api/mupdf/get-pages)
– line 118: const response = await fetch(.../api/mupdf/convert-page)

And any similar fetches in lib/mupdf/convert-page.ts, lib/mupdf/annotate-document.ts, or other modules that fetch arbitrary URLs.

Please refactor these to include the above safeguards and review any additional external fetches uncovered in your scan.

🧹 Nitpick comments (5)

lib/zod/url-validation.ts (2)
25-27: Consider additional double-encoding patterns

While the current checks cover common double-encoding attacks, consider adding checks for:

%252E (double-encoded dot)

%255C (double-encoded backslash)

Mixed case variants like %2e%2e or %2E%2e
   // Prevent double encoding attacks
-  if (pathOrUrl.includes("%2E%2E") || pathOrUrl.includes("%2F%2F")) {
+  const doubleEncodedPatterns = [
+    "%2E%2E", "%2e%2e", "%2E%2e", "%2e%2E",  // Various cases of ..
+    "%2F%2F", "%2f%2f",                       // //
+    "%252E", "%252e",                         // Double-encoded dot
+    "%255C", "%255c"                          // Double-encoded backslash
+  ];
+  if (doubleEncodedPatterns.some(pattern => pathOrUrl.toLowerCase().includes(pattern.toLowerCase()))) {
     return false;
   }
186-212: Duplicate regex pattern

The S3 path validation regex is duplicated in lines 109 and 208. Consider extracting it to a constant for maintainability.
+// Define the S3 path pattern as a constant at the top of the file
+const S3_PATH_PATTERN = /^[a-zA-Z0-9_-]+\/doc_[a-zA-Z0-9_-]+\/[a-zA-Z0-9_.-]+\.[a-zA-Z0-9]+$/;
+
 // Custom validator for file paths - either Notion URLs or S3 storage paths
 const createFilePathValidator = () => {
   return z
     .string()
     .min(1, "File path is required")
     .refine(
       (path) => {
         // ... existing code ...
         
         // Case 2: file storage paths - must match pattern: <id>/doc_<someId>/<name>.<ext>
-        const s3PathPattern =
-          /^[a-zA-Z0-9_-]+\/doc_[a-zA-Z0-9_-]+\/[a-zA-Z0-9_.-]+\.[a-zA-Z0-9]+$/;
-        return s3PathPattern.test(path);
+        return S3_PATH_PATTERN.test(path);
       },
       // ... rest of the code
     )
     
 // And update line 208:
-        return /^[a-zA-Z0-9_-]+\/doc_[a-zA-Z0-9_-]+\/[a-zA-Z0-9_.-]+\.[a-zA-Z0-9]+$/.test(
+        return S3_PATH_PATTERN.test(
           data.url,
         );
pages/api/teams/[teamId]/documents/[id]/versions/index.ts (1)
35-39: Consider using a more descriptive dummy name

The dummy name Version ${new Date().toISOString()} is added solely to satisfy schema validation. Consider using a more meaningful pattern or updating the schema to make the name optional for version creation.
     // Validate request body using Zod schema for security
     const validationResult = documentUploadSchema.safeParse({
       ...req.body,
-      name: `Version ${new Date().toISOString()}`, // Dummy name for validation
+      name: req.body.name || `Version_${documentId}_${Date.now()}`, // Use provided name or generate a meaningful default
     });
pages/api/webhooks/services/[...path]/index.ts (2)
55-64: Switching fileUrl to webhookFileUrlSchema is a solid upgrade

Strong + consistent validation at the schema layer reduces attack surface before any network I/O. One minor improvement you might consider in webhookFileUrlSchema: trim whitespace to avoid subtle failures (.transform((s) => s.trim())) before .url(). Not a blocker here.

300-303: Add a post-read size guard in case Content-Length was missing

If the server doesn’t send Content-Length, we can still enforce the limit after reading.
   // 6. Convert to buffer
   const fileBuffer = Buffer.from(await response.arrayBuffer());
+  if (typeof MAX_BYTES !== "undefined" && fileBuffer.byteLength > MAX_BYTES) {
+    return res.status(413).json({ error: "File too large" });
+  }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 4338400 and 3103b43.

📒 Files selected for processing (6)

lib/zod/url-validation.ts (1 hunks)
package.json (1 hunks)
pages/api/teams/[teamId]/documents/[id]/versions/index.ts (2 hunks)
pages/api/teams/[teamId]/documents/agreement.ts (3 hunks)
pages/api/teams/[teamId]/documents/index.ts (3 hunks)
pages/api/webhooks/services/[...path]/index.ts (4 hunks)

🧰 Additional context used

🧠 Learnings (7)

📓 Common learnings

Learnt from: CR
PR: mfts/papermark#0
File: .cursor/rules/rule-trigger-typescript.mdc:0-0
Timestamp: 2025-07-19T07:46:44.421Z
Learning: Applies to **/trigger/**/*.ts : When implementing schema-validated tasks, use `schemaTask` from `trigger.dev/sdk/v3` and provide a schema using Zod or another supported library.