Skip to main content 🎉️Crawlee for Python is open to early adopters!🥳️ CrawleeDocsExamplesAPIChangelogBlog Node.js Node.js Python 3.12 Next 3.12 3.11 3.10 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.0 2.2 1.3 GitHubDiscord Search Crawlee is a web scraping and browser automation library Crawlee is aweb scrapingandbrowser automationlibrary It helps you build reliable crawlers. Fast. Get Started npx crawlee create my-crawler Reliable crawling 🏗️ Crawlee won't fix broken selectors for you (yet), but it helps youbuild and maintain your crawlers faster. When a website addsJavaScript rendering, you don't have to rewrite everything, only switch to one of the browser crawlers. When you later find a great API to speed up your crawls, flip the switch back. It keeps your proxies healthy by rotating them smartly with good fingerprints that make your crawlers look human-like. It's not unblockable, butit will save you money in the long run. Crawlee is built by people who scrape for a living and use it every day to scrape millions of pages.Meet our community on Discord. JavaScript & TypeScript We believe websites are best scraped in the language they're written in. Crawleeruns on Node.js and it'sbuilt in TypeScriptto improve code completion in your IDE, even if you don't use TypeScript yourself. Crawlee supports both TypeScript and JavaScript crawling. HTTP scraping Crawlee makes HTTP requests thatmimic browser headers and TLS fingerprints. It also rotates them automatically based on data about real-world traffic. Popular HTML parsersCheerioandJSDOMare included. Headless browsers Switch your crawlers from HTTP toheadless browsersin 3 lines of code. Crawlee builds on top ofPuppeteer and Playwrightand adds its ownanti-blocking features and human-like fingerprints. Chrome, Firefox and more. Automatic scaling and proxy management Crawlee automatically manages concurrency based onavailable system resourcesandsmartly rotates proxies. Proxies that often time-out, return network errors or bad HTTP codes like 401 or 403 are discarded. Queue and Storage You cansave files, screenshots and JSON resultsto disk with one line of code or plug an adapter for your DB. Your URLs arekept in a queuethat ensures their uniqueness and that you don't lose progress when something fails. Helpful utils and configurability Crawlee includes tools forextracting social handlesor phone numbers, infinite scrolling, blocking unwanted assetsand many more. It works great out of the box, but also providesrich configuration options. Try Crawlee out 👾 before you start Crawlee requiresNode.js 16 or higher. The fastest way to try Crawlee out is to use theCrawlee CLIand choose theGetting startedexample. The CLI will install all the necessary dependencies and add boilerplate code for you to play with. npx crawlee create my-crawler If you prefer adding Crawleeinto your own project, try the example below. Because it usesPlaywrightCrawlerwe also need to install Playwright. It's not bundled with Crawlee to reduce install size. npminstallcrawlee playwright Run on import{PlaywrightCrawler}from'crawlee'; // PlaywrightCrawler crawls the web using a headless browser controlled by the Playwright library. constcrawler=newPlaywrightCrawler({ // Use the requestHandler to process each of the crawled pages. asyncrequestHandler({request,page,enqueueLinks,pushData,log}){ consttitle=awaitpage.title(); log.info(`Title of${request.loadedUrl}is '${title}'`); // Save results as JSON to `./storage/datasets/default` directory. awaitpushData({title,url:request.loadedUrl}); // Extract links from the current page and add them to the crawling queue. awaitenqueueLinks(); }, // Uncomment this option to see the browser window. // headless: false, // Comment this option to scrape the full website. maxRequestsPerCrawl:20, }); // Add first URL to the queue and start the crawl. awaitcrawler.run(['https://crawlee.dev']); // Export the whole dataset to a single file in `./result.csv`. awaitcrawler.exportData('./result.csv'); // Or work with the data directly. constdata=awaitcrawler.getData(); console.table(data.items); Deploy to the cloud ☁️ Crawlee is developed byApify, the web scraping and automation platform. You can deploy aCrawleeproject wherever you want (see our deployment guides forAWS LambdaandGoogle Cloud), but using theApify platformwill give you the best experience. With a few simple steps, you can convert yourCrawleeproject into a so-calledActor. Actors are serverless micro-apps that are easy to develop, run, share, and integrate. The infra, proxies, and storages are ready to go.Learn more about Actors. 1️⃣ First, install theApify SDKto your project, as well as theApify CLI. The SDK will help with the Apify integration, while the CLI will help us with the initialization and deployment. npminstallapify npminstall-gapify-cli 2️⃣ The next step is to addActor.init()to the beginning of your main script andActor.exit()to the end of it. This will enable the integration to the Apify Platform, so thecloud storages(e.g.RequestQueue) will be used. The code should look like this: Run on import{PlaywrightCrawler,Dataset}from'crawlee'; // Import the `Actor` class from the Apify SDK. import{Actor}from'apify'; // Set up the integration to Apify. awaitActor.init(); // Crawler setup from the previous example. constcrawler=newPlaywrightCrawler({ // ... }); awaitcrawler.run(['https://crawlee.dev']); // Once finished, clean up the environment. awaitActor.exit(); 3️⃣ Then you will need tosign up for the Apify account. Once you have it, use the Apify CLI to log in viaapify login. The last two steps also involve the Apify CLI. Call theapify initfirst, which will add Apify config to your project, and finally run theapify pushto deploy it. apify login# so the CLI knows you apify init# and the Apify platform understands your project apify push# time to ship it! Docs Guides Examples API reference Upgrading to v3 Community Blog Discord Stack Overflow Twitter More Apify Platform Docusaurus GitHub Crawlee is free and open source Built by