15 Feb 05:13

04d796b

Release v0.4 Latest

Latest

The biggest release of Scrapling yet — introducing the Spider framework, proxy rotation, and major parser improvements

This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.

🕷️ Spider Framework

A new async crawling framework built on top of anyio for structured, large-scale scraping:

from scrapling.spiders import Spider, Response

class MySpider(Spider):
  name = "demo"
  start_urls = ["https://example.com/"]

  async def parse(self, response: Response):
      for item in response.css('.product'):
          yield {"title": item.css('h2::text').get()}

MySpider().start()

Scrapy-like Spider API: Define spiders with start_urls, async parse callbacks, Request/Response objects, and priority queue.
Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
Streaming Mode: Stream scraped items as they arrive via async for item in spider.stream() with real-time stats - ideal for UI, pipelines, and long-running crawls.
Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with result.items.to_json() / result.items.to_jsonl() respectively.
Lifecycle hooks: on_start(), on_close(), on_error(), on_scraped_item(), and more hooks for full control over the crawl lifecycle.
Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
uvloop support: Pass use_uvloop=True to spider.start() for faster async execution when available.

A new section has been added to the website with the Full details. Click here

🔄 Proxy Rotation

New ProxyRotator class with thread-safe rotation. Works with all fetchers and sessions:

from scrapling import ProxyRotator
rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"])
Fetcher.get(url, proxy_rotator=rotator)

Custom rotation strategies: Make your own proxy rotation logic
Per-request proxy override: Pass proxy= to any individual get()/post()/fetch() call to override the session proxy for that request.

🌐 Browser Fetcher Improvements

Domain blocking: New blocked_domains parameter on DynamicFetcher/StealthyFetcher to block requests to specific domains (subdomains matched automatically).
Automatic retries: Browser fetchers now retry on failure with retries (default: 3) and retry_delay (default: 1s) parameters. Includes proxy-aware error detection.
Response metadata: Response.meta dict automatically stores the proxy used, and merges request metadata.
Response.follow(): Create follow-up Request objects with automatic referer flow, designed for the spider system.
No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
Speed: Improved stealth and speed by adjusting browser flags.

🔧 Bug Fixes & Improvements

Parser optimization: Optimized the parser for repeated operations, improving performance.
Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
Empty body: Handle responses with empty body.
Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.

⚠️ Breaking Changes

css_first/xpath_first removed: Use css('.selector').first, css('.selector')[0], or css('.selector').get() instead.
All selection now returns Selectors: css('::text'), xpath('//text()'), css('::attr(href)'), and xpath('//@href') now return Selectors (wrapping text nodes in Selector objects with tag="#text") instead of TextHandlers. This makes the API consistent across all selection methods and the type hints.
Response.body is always bytes: Previously could be str or bytes, now always returns bytes.
get()/getall() behavior: On Selector: get() returns TextHandler (serialized HTML or text value), getall() returns TextHandlers. Aliases: extract_first = get, extract = getall. Old get_all() on Selectors is removed.
Selectors.first/.last: Safe accessors that return Selector | None instead of raising IndexError.
Internal constants renamed: DEFAULT_FLAGS → DEFAULT_ARGS, DEFAULT_STEALTH_FLAGS → STEALTH_ARGS, HARMFUL_DEFAULT_ARGS → HARMFUL_ARGS, DEFAULT_DISABLED_RESOURCES → EXTRA_RESOURCES.

🔨 Other Changes

Dependency changes: Replaced tldextract with tld, removed internal _html_utils.py in favor of w3lib.html.replace_entities, added typing_extensions as a hard requirement.
Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

03 Jan 20:29

github-actions

v0.3.14

de79fe8

Release v0.3.14

A minor maintenance update to fix issues that happened with some devices in v0.3.13

Disabled the incognito mode in StealthyFetcher and its session classes since it made cookies not persistent across pages on Windows devices. It didn't happen on MacOS and Linux (Fixes #123, thanks to @frugality4121 for bringing it up and to @gembleman for pointing out the solution).
Pinned down the last version of browserforge to solve the issue with old header models for users with an already old browserforge version.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Contributors

gembleman and frugality4121

Assets 2

01 Jan 20:07

github-actions

v0.3.13

6ad2ecb

Release v0.3.13

This is a big update with many improvements across many places, but also many breaking changes for good reasons. Please read the below before updating

For many reasons, we decided that from now on, we will stop using Camoufox entirely, and we might switch back to it in the future if its development continues. If you prefer to continue using Camoufox as before this release, there are instructions for that in this section.
Previously, we were using patchright in the stealth mode inside DynamicFetcher and its session classes. Now we removed the stealth mode from them and started using patchright inside StealthyFetcher and its session classes, with A LOT of improvements, as you will see, improving the stealth overall on top of patchright.

This makes StealthyFetcher and its session classes 101% faster than before, use less memory and space, and have ~400 lines of code shorter, but, most importantly, are more stable than when we used Camoufox before.

This will also shorten the installation time of the scrapling install command, reduce the size of the Docker image, improve test smoothness in GitHub's CI, and make scrapling less confusing for new users.

Breaking changes

The stealth argument was removed from the DynamicFetcher class and its session class, while the hide_canvas argument was moved to the StealthyFetcher and its session classes.
The disable_webgl argument has been moved from DynamicFetcher to the StealthyFetcher class and renamed as allow_webgl. All session classes as well.
The StealthyFetcher class is now basically the new stealthy version of DynamicFetcher, so the following arguments are removed: block_images, humanize, addons, os_randomize, disable_ads, and geoip. I tried to replicate them in Chromium, but each had its own problem. This might change with upcoming releases before v0.4.

Now to the good news, we have improved and fixed a lot of stuff :)

Improvements

You already know that the StealthyFetcher class and its session classes are now 101% faster than before, but now also the DynamicFetcher class and its session class are 20% faster.
Cloudflare's solver algorithm has been improved over before now to finish faster and handle more cases. Also, thanks to the new refactor, expect the solver to solve the captcha twice as fast!
All fetchers now use less memory.
The MCP server now uses fewer tokens to save more money!
The Docker image is now 60% smaller.
The whole documentation website has been updated with the new stuff. At the same time, it was made more explicit, many sections were shortened, more examples were added, missing arguments were included, the API reference section was updated with graphs, and many other improvements were made. The Website now loads 130% faster, uses less data, and is better for SEO.

Fixes

Added the arguments that were missing before in the Web Scraping shell shortcuts and made them more accurate.
Fixed the issue where the google_search argument was creating a Google referrer even if the URL is a localhost/IP.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

18 Dec 00:21

github-actions

v0.3.12

2e5152f

Release v0.3.12

What's Changed

Added a new argument to DynamicSession/AsyncDynamicSession classes called timezone_id, which allows you to set the timezone of the browser so that it matches the timezone of the Proxy/VPN you are using. That way, the websites can't detect that you are using a proxy through the timezone mismatch technique.
Improved the automated conversion of response to JSON.
Renamed the internal function __create__ to start inside fetchers' session classes to make it easier to use them outside the with context.
Updated curl_cffi and other deps to the latest versions.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

03 Dec 01:53

github-actions

v0.3.11

0f9127e

Release v0.3.11

What's Changed

Added a better logic for handling timeout errors when the network_idle argument is used on an unstable website (websites with media playing, etc.)
Fixed the autocompletion for the stealthy_fetch shortcut in the Web Scraping Shell

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

26 Nov 17:49

github-actions

v0.3.10

ab0be95

Release v0.3.10

A maintenance update with many significant changes and possible breaking changes

Solved all encoding issues by using a better approach which will handle web pages where encoding is not correctly declared (Thanks to @Kemsty2's efforts for pointing that out in #110 #111 )
Solved a logical issue with overriding session-level parameters with request-level parameters in all browser-based fetchers that was present since v0.3
Fixed the signatures of the shortcuts in the interactive web scraping shell, which made a perfect autocompletion experience for the shortcuts in the shell. This issue has been present since v0.3 as well.
Pumped up the version for the Maxmind database, which will improve the geoip argument for StealthyFetcher and its session classes.
Updated all used browser versions to the latest available ones.
BREAKING - all fetchers had gone through a big refactor, which resulted in some interesting things that might break your code:
1. Scrapling codebase is now smaller by ~750 lines and many changes which would make maintenance very much easier in the future and use a bit less resources.
2. The validation for all fetchers and their session classes became much faster, which will reflect on their overall speed.
3. To achieve this, now all fetchers can't accept standard arguments other than the url argument; the rest of the arguments must be keyword-arguments so your code must be like Fetcher.get('https://google.com', stealthy_headers=True) not Fetcher.get('https://google.com', True) if you were doing that for some reason!
4. An annoying difference between browser-based fetchers and their session classes since v0.3 was that the argument used to pass custom parser settings per request was called custom_config, while it was named selector_config in the session classes. This refactor allowed us to unify the naming to selector_config without breaking your code, so the main one is now selector_config with backward compatibility for the custom_config argument. The autocompletion support will be available only for the selector_config argument.
5. Also, to achieve all of this, we had to make the type hints of the fetchers' functions dynamically generated, so if you don't get a proper autocompletion in your IDE, make sure you are using a modern version of it. We have tested almost all known IDEs/editors.

We have also updated all benchmark tables with the current numbers against the latest versions of all alternative libraries.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Contributors

Kemsty2

Assets 2

17 Nov 01:38

github-actions

v0.3.9

e6e65ba

Release v0.3.9

A new update with many important changes

🚀 New Stuff and quality of life changes

Now the impersonate argument in Fetcher and FetcherSession can accept a list of browsers that the library will choose a random browser from them with each request.

from scrapling.fetchers import FetcherSession

with FetcherSession(impersonate=['chrome', 'firefox', 'safari']) as s:
  s.get('https://github.com/D4Vinci/Scrapling')

A new argument to the clean method in TextHandler to remove html entities from the current text easily.
Huge improvements to the documentation with more precise explanations of many parts and automatic translations of the main README.md file.

🐛 Bug Fixes

Fixed a big issue with retrieving responses from browser-based fetchers. Now, there is intelligent content type detection that ensures response.body contains the rendered browser content only if the content is HTML; otherwise, it contains the raw content of the last request made. This allows you to download binary files and text-based files without having to find them wrapped in HTML tags, while being able to retrieve the rendered content you want from the website when fetching it.

🔨 Misc

Updated the contributing guide to make it clearer and easier.
Add a new workflow to enforce code quality tools (Same ones used as pre-commit hooks).

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

27 Oct 15:08

github-actions

v0.3.8

58784b7

Release v0.3.8

A new update with many important changes

🚀 New Stuff and quality of life changes

For all browser-based fetchers: websites that never finish loading their requests won't crash the code now if you used network_idle with them.
The logic for collecting/checking for page content in browser-based fetchers has been changed to make browsers more stable on Windows systems now, as Linux/MacOS (All this difference in behaviour is because of Playwright's different implementation on Windows systems).
Refactored all the validation logic, which made all requests done from all browser-based fetchers faster by 8-15%
A New option called extra_flags has been added to DynamicFetcher and its session to allow users to add custom Chrome flags to the existing ones while launching the browser.
Reverted the route logic for catching responses (changed in the last version) to use the old routing version when page_action is used. This was added to collect the latest version of a page's content in case page_action changes it without making a request. (Thanks for @gembleman to pointing it in #100 and #102 )

🐛 Bug Fixes

Fixed a typo in load_dom in DynamicSession's async_fetch
Fixed an issue with Cloudflare solver that made the solver wait forever for embedded captchas that don't disappear after solving. Now it will wait for the captcha to disappear for 30 seconds, then assume it's the type that doesn't disappear (Fixes #100 )

🔨 Misc

Now the Docker image is automatically pushed to Dockerhub and GitHub's container registry for user convenience.
Added a new documentation page to show how to use Scrapeless browser with Scrapling.

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Contributors

gembleman

Assets 2

12 Oct 04:35

github-actions

v0.3.7

5bd5dc5

Release v0.3.7

A new update with many important changes

🚀 New Stuff and quality of life changes

Reworked solve_cloudflare argument in StealthyFetcher to make it able to solve all kinds of custom implementations of Turnstile.
Refactored the entire codebase to be acceptable by Pyright, so expect a flawless IDE experience now with all software and many bugs solved.
Refactored the requests logic to be cleaner and faster (Also solves #97 )
Added a new option user_data_dir to all browser-based session classes to allow the user to reuse the browser session data (cookies/storage/etc...) from previous sessions. Leaving it will cause Playwright to use a random directory on each run, as was happening before.
Added a new customization option additional_args to Dynamic fetcher and its session class to enable the user to pass extra arguments to Playwright's context, as we had with StealthyFetcher before.
The route logic for collecting the last navigation response for all browsers has been improved, which allows the raw responses to be passed to the parser before being processed by the browsers as before. This will be very helpful with text/JSON responses.

🐛 Bug Fixes

The rework of the route logic solved an issue with retrieving the content of unstable websites on some Windows devices.
All the refactors that happened in this version solved a lot of bugs along the way that were hard to spot before, and weird autocompletion issues with some IDEs.
Many fixes to the documentation website

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

01 Oct 03:40

github-actions

v0.3.6

e8b0e72

Release v0.3.6

🚀 New Stuff

Improved the solve_cloudflare argument in StealthyFetcher and its session classes to be able to solve all types of both Turnstile and interstitial Cloudflare challenges 🎉
Now the MCP server has the option to use Streamable HTTP, so you can easily expose the server.
Added Docker support, so now an image is built and pushed to Docker Hub automatically with each release (contains all browsers)

🐛 Bug Fixes

Fixed an encoding issue with the parser that happened in some cases (the famous invalid start byte error)
Restructured multiple parts of the library to fix some memory leaks, so now enjoy noticably lower memory usage based on your config (Also solves #92 )
Improved type annotation in many parts of the code so you can have a better IDE experience (Also solves #93 )

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Big shoutout to our biggest Sponsors

Assets 2

Uh oh!

Releases: D4Vinci/Scrapling

Release v0.4

🕷️ Spider Framework

🔄 Proxy Rotation

🌐 Browser Fetcher Improvements

🔧 Bug Fixes & Improvements

⚠️ Breaking Changes

🔨 Other Changes

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.14

Big shoutout to our biggest Sponsors

Contributors

Uh oh!

Release v0.3.13

Breaking changes

Improvements

Fixes

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.12

What's Changed

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.11

What's Changed

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.10

Big shoutout to our biggest Sponsors

Contributors

Uh oh!

Release v0.3.9

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

🔨 Misc

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.8

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

🔨 Misc

Big shoutout to our biggest Sponsors

Contributors

Uh oh!

Release v0.3.7

🚀 New Stuff and quality of life changes

🐛 Bug Fixes

Big shoutout to our biggest Sponsors

Uh oh!

Release v0.3.6

🚀 New Stuff

🐛 Bug Fixes

Big shoutout to our biggest Sponsors

Uh oh!