You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Web Crawling — Security Reconnaissance Methodology Taxonomy
Classification Structure
This taxonomy structures crawling techniques for exhaustive attack surface discovery of web applications. The primary axis is discovery technique (§1–§7), with the type of target each technique uncovers as a cross-cutting axis.
Discovery Target
Description
Paths/Endpoints
URLs, routes, directories, files
Parameters
Query, Body, Header, Cookie parameters
API Schema
Operations, Types, Fields, Mutations
Secrets
API keys, tokens, credentials, internal URLs
Tech Stack
Frameworks, versions, middleware, servers
Fundamental principle of crawling: an application always has more surface area than what it intentionally exposes. Deployment artifacts, legacy endpoints, debug interfaces, paths hardcoded in client code — each discovery technique reveals a different region of this hidden surface.
§1. Active Spidering (Link-Based Active Crawling)
The most fundamental crawling approach: visiting pages like a browser, following links, and recursively exploring the application's structure. Modern crawlers (e.g., Burp Suite) go beyond simple link-following — they also submit forms, execute JavaScript, and interact with clickable elements — but discovery scope remains bounded by reachable application states.
§1-1. Traversal Strategy
Subtype
Mechanism
Use Case
Breadth-First
Visits all links at the same depth before proceeding to the next level. Quickly covers top-level pages and discovers high-importance pages first
Initial surface mapping — rapidly understanding the overall structure of large sites
Depth-First
Follows a single path to its end before backtracking. Ensures deeply nested functionality (multi-step wizards, nested categories) is not missed
Complete exploration of specific functional areas — payment flows, admin panels, etc.
Hybrid (Adaptive)
Starts with BFS to grasp the overall structure, then applies DFS to areas of interest
Common default in mature crawlers (e.g., Burp Suite, OWASP ZAP)
§1-2. Rendering Mode
Subtype
Mechanism
Key Condition
HTTP-Only (Static)
Parses HTML source only, extracting URLs from <a>, <form>, <link> tags. Fast and lightweight
Server-rendered pages (SSR), legacy applications
Headless Browser (Dynamic)
Uses Puppeteer, Playwright, etc. to execute JavaScript and collect links generated after DOM mutations. Essential for SPAs
React, Angular, Vue and other client-rendered apps
Hybrid Rendering
Performs initial crawl with HTTP-Only for speed, switches to headless when JS-dependent paths are detected
Balancing speed and coverage on large sites
§1-3. Scope Control
Subtype
Mechanism
Key Condition
Same-Origin Restriction
Only follows links within the same origin (scheme+host+port)
Deep analysis of a single application
Same-Domain Extension
Crawls including subdomains (*.example.com)
Microservice architectures, environments with functionality split across subdomains
Discovers hidden paths that have no existing links by guessing them using wordlists. While Active Spidering finds only what is linked, bruteforcing finds what exists but is not linked.
§2-1. Basic Bruteforcing
Subtype
Mechanism
Key Condition
Directory Enumeration
Tests common path names like /admin/, /backup/, /api/v1/ and checks for 200/301/403 responses
After finding endpoints — the next step is discovering hidden parameters those endpoints process. This includes admin-only parameters removed from client code, debug flags, and undocumented filters.
§3-1. Parameter Bruteforcing
Subtype
Mechanism
Key Condition
GET Parameter Fuzzing
Tests mass parameters like ?debug=1, ?admin=true, ?format=json and detects response changes
Arjun (25,890 parameters, tested in ~50 requests), x8 (Rust-based, high-speed)
POST Body Fuzzing
Tests JSON/form body parameters like {"role":"admin"}, {"debug":true}
Arjun -m POST, x8 -X POST
HTTP Header Fuzzing
Tests custom headers like X-Forwarded-For, X-Original-URL, X-Rewrite-URL to discover hidden functionality or access control bypasses
Inserts additional parameters into cookie values to check for server-side processing
Applications with cookie-based configuration/feature toggling
§3-2. Response Change Detection
Subtype
Mechanism
Key Condition
Status Code Change
Response code changes when a specific parameter is added (200→403, 200→500) — indicates the parameter is being processed
The clearest signal
Response Size Change
Body size changes significantly — additional data returned or error messages altered
Requires baseline response size recording for noise reduction
Response Time Change
Specific parameter triggers a DB query or external call, increasing response time
Timing-based blind detection
Reflection Detection
Parameter value is reflected in the response — indicates potential XSS, SSTI, header injection
Tracking where input values are reflected
§3-3. Passive Parameter Mining
Subtype
Mechanism
Key Condition
Web Archive Parameter Extraction
ParamSpider — collects past URLs from Wayback Machine for the target domain to extract parameter names
Parameters that existed in the past may still be processed by the server
HTML Source Parameter Extraction
Collects parameter names from comments, hidden form fields, disabled inputs, data-* attributes
Parameters removed from the client but still processed server-side
JS Source Parameter Extraction
Analyzes parameters in fetch/XHR calls, JSON keys, and configuration objects within JavaScript code
Links with §4 JavaScript Analysis
§4. JavaScript Analysis (Client Code Analysis)
The core attack surface of modern web applications is exposed in JavaScript source code. API endpoints, auth tokens, internal URLs, routing rules, and debug functionality are bundled and sent to the client.
§4-1. Endpoint Extraction
Subtype
Mechanism
Key Condition
Regex-Based URL Extraction
LinkFinder — matches URL/path patterns in JS files using regex
Fast but has false positives; suitable for initial scanning
AST-Based Precise Extraction
jsluice — parses AST via go-tree-sitter, extracting only URLs in actual usage contexts: fetch(), XMLHttpRequest, window.open(), document.location
Higher accuracy than regex, reduced false positives
Burp Passive Collection
JSpector — passively analyzes JS files passing through the proxy, automatically registers discovered endpoints as Burp issues
Automatic JS endpoint detection during live traffic analysis
Bundle Analysis (Source Map)
If .js.map files exist, original source structure can be restored — clearly identifying per-component API calls, route definitions, etc.
When source maps are exposed in production (common)
§4-2. Secret Extraction
Subtype
Mechanism
Key Condition
API Keys / Tokens
SecretFinder — detects API key patterns (AIza..., sk-..., ghp_...), JWTs, Bearer tokens via regex
Authentication info hardcoded in client JS
Internal URLs / Endpoints
Exposure of non-public URLs: staging servers (staging.internal.example.com), internal APIs (http://10.0.0.x/api)
Development environment URLs remaining in production builds
Configuration Objects
Feature flags, environment variables, service URLs exposed in global variables: window.__CONFIG__, window.__INITIAL_STATE__
SPA initial state delivery pattern
Information in Comments
Developer comments containing TODOs, FIXMEs, internal notes, references to disabled features
Extracted from pre-minified JS or source maps
§4-3. Route & Access Control Analysis
Subtype
Mechanism
Key Condition
SPA Route Extraction
Extracts the full route map from React Router, Vue Router, Angular Router configs — including admin-only paths and hidden pages
Identifies admin-only endpoints/features from client-side access control logic (if (user.role === 'admin'))
SPAs with client-side authorization checks
Inactive Feature Detection
Code paths disabled by feature flags still included in JS — server-side endpoints may remain active
Feature flag-based development, gradual rollouts
§4-4. Historical JS Analysis
Subtype
Mechanism
Key Condition
Archive JS Comparison
Collects past versions of JS files from Wayback Machine and diffs against current versions — discovers removed endpoints, changed API paths, deleted secrets
Download past JS files with waymore
Git History JS Analysis
Tracks JS change history from exposed .git directories or GitHub repos — finds API keys and endpoints removed in commits
.git directory exposure or source repo access available
§5. API Surface Discovery (API Schema Enumeration)
Techniques for discovering the full schema of API interfaces including REST, GraphQL, SOAP, and WebSocket. Web UI crawling alone reveals only a portion of the API; schema enumeration uncovers undocumented operations, fields, and types.
§5-1. REST / OpenAPI Discovery
Subtype
Mechanism
Key Condition
Swagger/OpenAPI File Search
Bruteforces known paths: /swagger.json, /openapi.yaml, /api-docs, /v2/api-docs, /swagger-ui.html
When developers haven't disabled documentation endpoints
API Version Enumeration
/api/v1/, /api/v2/, /api/v3/ — older versions may have more lenient authentication/validation
Environments running multiple API versions in parallel
HTTP Method Fuzzing
Tests various methods (GET, POST, PUT, DELETE, PATCH, OPTIONS) against the same endpoint to discover undocumented operations
Allow header in OPTIONS response, or 405 vs 200 response differences
Content-Type Switching
Sends requests in various formats (application/json, application/xml, application/x-www-form-urlencoded) to the same endpoint — parser differences may expose additional attack surface
When the server accepts multiple Content-Types
§5-2. GraphQL Discovery
Subtype
Mechanism
Key Condition
Introspection Query
{__schema{types{name,fields{name}}}} — bulk extraction of the entire schema (types, fields, mutations, queries)
When introspection is enabled (common even in production)
When introspection is fully disabled, tools like Clairvoyance exploit field-suggestion error messages (e.g., Did you mean X?) to incrementally reconstruct the schema
Hidden endpoint and type discovery without introspection
Field Suggestion-Based Recovery
When introspection is disabled, sending incorrect field names triggers error messages suggesting similar field names — repeating this reconstructs the schema
Techniques for collecting attack surface information without sending requests directly to the target server (or with minimal requests). Low direct target visibility (though third-party queries to archives/search engines may still be logged or detected), ability to restore deleted content.
§6-1. Web Archives
Subtype
Mechanism
Key Condition
Wayback Machine URL Collection
waybackurls, gau, waymore — extracts the complete list of past URLs for the target domain from the Wayback CDX API
Domains with archived capture history
CommonCrawl Data
Extracts URLs and response data related to the target domain from CommonCrawl indexes
gau includes automatic CommonCrawl search
Past Response Download
waymore — downloads not just URLs but actual past responses (HTML, JS, JSON) from multiple sources (Wayback, CommonCrawl) to restore deleted content
Verifying deleted pages, removed API responses
Change History Comparison
Compares temporal snapshots of the same URL to detect added/removed functionality, endpoints, and secrets
Time-series diff analysis
§6-2. Search Engines & Indexes
Subtype
Mechanism
Key Condition
Google Dorking
Advanced operators like site:example.com filetype:pdf, site:example.com inurl:admin, site:example.com ext:sql to discover indexed sensitive content
When Google has crawled the content
GitHub/GitLab Search
"example.com" password, "example.com" api_key — searching public repos for target-related code, credentials, and internal URLs
Secrets accidentally committed by developers
Shodan / Censys
IP-based service enumeration — open ports, service banners, SSL certificate info, HTTP response headers
Full picture of internet-exposed services
URLScan.io
Collects URLs, resources, and redirect chains for the target domain from other users' scan results
When third parties have already scanned the target
§6-3. Certificate Transparency
Subtype
Mechanism
Key Condition
crt.sh Subdomain Enumeration
Queries CT logs for the target domain's SSL certificate history — discovers subdomains, internal hostnames, wildcard patterns
All domains using HTTPS
Internal Hostname Exposure
Internal subdomains like staging.internal.example.com, jenkins.corp.example.com included in certificates
When public certificates are issued for internal services
§6-4. Metafile Analysis
Subtype
Mechanism
Key Condition
robots.txt
Extracts the list of paths intended to be hidden from Disallow entries — /admin/, /internal/, /api/debug/, etc.
When paths are listed for crawler control purposes, not security
sitemap.xml
An XML file listing discoverable URLs (subject to 50,000 URL / 50 MB limits per file; sitemap indexes can chain multiple files) — check robots.txt for its referenced location. Coverage is not guaranteed to be complete
When the sitemap is publicly accessible
security.txt
/.well-known/security.txt — security contact, policy scope, preferred languages
Bug bounty/VDP program information
humans.txt, crossdomain.xml
Development team info, Flash/Silverlight cross-domain policies, and other metadata
When legacy configuration files remain
§7. Authenticated Crawling
Features hidden behind login walls can only be discovered by crawling in an authenticated state. Since the exposed surface varies by privilege level (regular user, admin, API client), multi-role crawling is key.
§7-1. Session Acquisition & Maintenance
Subtype
Mechanism
Key Condition
Form-Based Login
Crawler automatically identifies login forms and submits credentials to obtain session cookies
Injects Bearer tokens, JWTs, or API keys into request headers — API crawling without cookies
Token auto-refresh in ZAP/Burp session handling rules
OAuth Flow Automation
Completes the OAuth authentication flow via headless browser, extracts the access token for reuse in HTTP clients
OAuth-based authentication (Google, GitHub, etc.)
Cookie Transplant
Manually logs in via browser, extracts cookies, and injects them into the crawler
When automatic login is difficult due to 2FA, CAPTCHA, etc.
Session Expiry Detection & Re-auth
Detects session expiry during crawl (302 to login, 401 response) and automatically re-authenticates
Long-running crawls, short session timeouts
§7-2. Multi-Role Crawling
Subtype
Mechanism
Key Condition
Unauthenticated → Authenticated Comparison
Crawls the same endpoint in both unauthenticated and authenticated states, comparing differences — identifies links, features, and parameters visible only after authentication
Foundation data for privilege escalation testing
Per-Role Crawling
Crawls as each role (regular user, editor, admin) to map role-specific features and endpoints
Burp Crawler's multiple login feature
Authorization Matrix Generation
Aggregates crawl results across all roles to build an endpoint × role access matrix → basis for IDOR, horizontal/vertical privilege escalation testing
Integration with Autorize (Burp extension), etc.
§7-3. Mobile API Crawling
Subtype
Mechanism
Key Condition
Mobile User-Agent
Changes crawler's UA to a mobile device to discover mobile-only endpoints, lightweight APIs, and different response formats
Services with separate APIs for mobile apps
App Traffic Capture
Captures mobile app proxy traffic to collect API endpoints, parameters, and authentication flows
Understanding the API surface without binary analysis
1. Never rely on a single technique. Active Spidering finds only linked content, Bruteforcing finds only guessable paths, and JS Analysis finds only information included in client code. Each technique covers a different area, and maximum coverage comes from the union of all techniques.
2. Start passive. Before sending a single request to the target, first secure all information collectible from archives and public sources. This allows mapping a significant portion of the attack surface without detection, and guides the direction of subsequent active crawling.
3. Adapt to the tech stack. Use /actuator/* wordlists for Spring Boot apps, introspection for GraphQL services, and headless rendering for SPAs. General-purpose crawling is the baseline, but technology-specific crawling reveals unique surface area.
4. Leverage the time axis. Analyze not just the current live state, but also past states (Wayback), source history (Git), and API version history. Removed endpoints and secrets may still be functional on the server.
5. Switch roles. The same application exposes different attack surfaces from the perspectives of unauthenticated users, regular users, admins, API clients, and mobile apps. Multi-role crawling provides the foundation data for IDOR and privilege escalation vulnerabilities.