Releases: D4Vinci/Scrapling
Release v0.4
The biggest release of Scrapling yet β introducing the Spider framework, proxy rotation, and major parser improvements
This release brings a fully async spider/crawling framework, intelligent proxy management, and significant API changes that make Scrapling more powerful and consistent. Please review the breaking changes section carefully before upgrading.
π·οΈ Spider Framework
A new async crawling framework built on top of anyio for structured, large-scale scraping:
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()- Scrapy-like Spider API: Define spiders with
start_urls, asyncparsecallbacks,Request/Responseobjects, and priority queue. - Concurrent Crawling: Configurable concurrency limits, per-domain throttling, and download delays.
- Multi-Session Support: Unified interface for HTTP requests, and stealthy headless browsers in a single spider - route requests to different sessions by ID. Supports lazy session initialization.
- Pause & Resume: Checkpoint-based crawl persistence. Press Ctrl+C to gracefully shut down; then restart to resume from where you left off.
- Streaming Mode: Stream scraped items as they arrive via
async for item in spider.stream()with real-time stats - ideal for UI, pipelines, and long-running crawls. - Blocked Request Detection: Automatic detection and retry of blocked requests with customizable logic.
- Built-in Export: Export results through hooks and your own pipeline or the built-in JSON/JSONL with
result.items.to_json()/result.items.to_jsonl()respectively. - Lifecycle hooks:
on_start(),on_close(),on_error(),on_scraped_item(), and more hooks for full control over the crawl lifecycle. - Detailed crawl stats: track requests, responses, bytes, status codes, proxies, per-domain/session breakdowns, log level counts, and more.
- uvloop support: Pass
use_uvloop=Truetospider.start()for faster async execution when available.
A new section has been added to the website with the Full details. Click here
π Proxy Rotation
- New
ProxyRotatorclass with thread-safe rotation. Works with all fetchers and sessions:from scrapling import ProxyRotator rotator = ProxyRotator(["http://proxy1:8080", "http://proxy2:8080"]) Fetcher.get(url, proxy_rotator=rotator)
- Custom rotation strategies: Make your own proxy rotation logic
- Per-request proxy override: Pass
proxy=to any individualget()/post()/fetch()call to override the session proxy for that request.
π Browser Fetcher Improvements
- Domain blocking: New
blocked_domainsparameter onDynamicFetcher/StealthyFetcherto block requests to specific domains (subdomains matched automatically). - Automatic retries: Browser fetchers now retry on failure with
retries(default: 3) andretry_delay(default: 1s) parameters. Includes proxy-aware error detection. - Response metadata:
Response.metadict automatically stores the proxy used, and merges request metadata. - Response.follow(): Create follow-up
Requestobjects with automatic referer flow, designed for the spider system. - No autoplay: Browser sessions are now blocking autoplay content, which caused issues before.
- Speed: Improved stealth and speed by adjusting browser flags.
π§ Bug Fixes & Improvements
- Parser optimization: Optimized the parser for repeated operations, improving performance.
- Errored pages: Fixed a bug that caused the browser to not close when pages gave errors.
- Empty body: Handle responses with empty body.
- Playwright loop: Solving an issue with leaving the Playwright loop open when CDP connection fails
- Type safety: Fixed all mypy errors and added type hints across untyped function bodies. Added mypy and pyright to the CI workflow.
β οΈ Breaking Changes
css_first/xpath_firstremoved: Usecss('.selector').first,css('.selector')[0], orcss('.selector').get()instead.- All selection now returns
Selectors:css('::text'),xpath('//text()'),css('::attr(href)'), andxpath('//@href')now returnSelectors(wrapping text nodes inSelectorobjects withtag="#text") instead ofTextHandlers. This makes the API consistent across all selection methods and the type hints. Response.bodyis alwaysbytes: Previously could bestrorbytes, now always returnsbytes.get()/getall()behavior: OnSelector:get()returnsTextHandler(serialized HTML or text value),getall()returnsTextHandlers. Aliases:extract_first = get,extract = getall. Oldget_all()onSelectorsis removed.Selectors.first/.last: Safe accessors that returnSelector | Noneinstead of raisingIndexError.- Internal constants renamed:
DEFAULT_FLAGSβDEFAULT_ARGS,DEFAULT_STEALTH_FLAGSβSTEALTH_ARGS,HARMFUL_DEFAULT_ARGSβHARMFUL_ARGS,DEFAULT_DISABLED_RESOURCESβEXTRA_RESOURCES.
π¨ Other Changes
- Dependency changes: Replaced
tldextractwithtld, removed internal_html_utils.pyin favor ofw3lib.html.replace_entities, addedtyping_extensionsas a hard requirement. - Docs overhaul: Full switch from MkDocs to Zensical, new spider documentation section, updated all existing pages, and added new API references.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.14
A minor maintenance update to fix issues that happened with some devices in v0.3.13
- Disabled the incognito mode in
StealthyFetcherand its session classes since it made cookies not persistent across pages on Windows devices. It didn't happen on MacOS and Linux (Fixes #123, thanks to @frugality4121 for bringing it up and to @gembleman for pointing out the solution). - Pinned down the last version of browserforge to solve the issue with old header models for users with an already old browserforge version.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.13
This is a big update with many improvements across many places, but also many breaking changes for good reasons. Please read the below before updating
-
For many reasons, we decided that from now on, we will stop using Camoufox entirely, and we might switch back to it in the future if its development continues. If you prefer to continue using Camoufox as before this release, there are instructions for that in this section.
-
Previously, we were using patchright in the stealth mode inside
DynamicFetcherand its session classes. Now we removed the stealth mode from them and started using patchright insideStealthyFetcherand its session classes, with A LOT of improvements, as you will see, improving the stealth overall on top of patchright.
This makes StealthyFetcher and its session classes 101% faster than before, use less memory and space, and have ~400 lines of code shorter, but, most importantly, are more stable than when we used Camoufox before.
This will also shorten the installation time of the scrapling install command, reduce the size of the Docker image, improve test smoothness in GitHub's CI, and make scrapling less confusing for new users.
Breaking changes
- The
stealthargument was removed from theDynamicFetcherclass and its session class, while thehide_canvasargument was moved to theStealthyFetcherand its session classes. - The
disable_webglargument has been moved fromDynamicFetcherto theStealthyFetcherclass and renamed asallow_webgl. All session classes as well. - The
StealthyFetcherclass is now basically the new stealthy version ofDynamicFetcher, so the following arguments are removed:block_images,humanize,addons,os_randomize,disable_ads, andgeoip. I tried to replicate them in Chromium, but each had its own problem. This might change with upcoming releases before v0.4.
Now to the good news, we have improved and fixed a lot of stuff :)
Improvements
- You already know that the
StealthyFetcherclass and its session classes are now 101% faster than before, but now also theDynamicFetcherclass and its session class are 20% faster. - Cloudflare's solver algorithm has been improved over before now to finish faster and handle more cases. Also, thanks to the new refactor, expect the solver to solve the captcha twice as fast!
- All fetchers now use less memory.
- The MCP server now uses fewer tokens to save more money!
- The Docker image is now 60% smaller.
- The whole documentation website has been updated with the new stuff. At the same time, it was made more explicit, many sections were shortened, more examples were added, missing arguments were included, the API reference section was updated with graphs, and many other improvements were made. The Website now loads 130% faster, uses less data, and is better for SEO.
Fixes
- Added the arguments that were missing before in the Web Scraping shell shortcuts and made them more accurate.
- Fixed the issue where the
google_searchargument was creating a Google referrer even if the URL is a localhost/IP.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.12
What's Changed
- Added a new argument to
DynamicSession/AsyncDynamicSessionclasses calledtimezone_id, which allows you to set the timezone of the browser so that it matches the timezone of the Proxy/VPN you are using. That way, the websites can't detect that you are using a proxy through the timezone mismatch technique. - Improved the automated conversion of response to JSON.
- Renamed the internal function
__create__tostartinside fetchers' session classes to make it easier to use them outside thewithcontext. - Updated
curl_cffiand other deps to the latest versions.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.11
What's Changed
- Added a better logic for handling timeout errors when the
network_idleargument is used on an unstable website (websites with media playing, etc.) - Fixed the autocompletion for the
stealthy_fetchshortcut in the Web Scraping Shell
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.10
A maintenance update with many significant changes and possible breaking changes
- Solved all encoding issues by using a better approach which will handle web pages where encoding is not correctly declared (Thanks to @Kemsty2's efforts for pointing that out in #110 #111 )
- Solved a logical issue with overriding session-level parameters with request-level parameters in all browser-based fetchers that was present since v0.3
- Fixed the signatures of the shortcuts in the interactive web scraping shell, which made a perfect autocompletion experience for the shortcuts in the shell. This issue has been present since v0.3 as well.
- Pumped up the version for the Maxmind database, which will improve the
geoipargument forStealthyFetcherand its session classes. - Updated all used browser versions to the latest available ones.
- BREAKING - all fetchers had gone through a big refactor, which resulted in some interesting things that might break your code:
- Scrapling codebase is now smaller by ~750 lines and many changes which would make maintenance very much easier in the future and use a bit less resources.
- The validation for all fetchers and their session classes became much faster, which will reflect on their overall speed.
- To achieve this, now all fetchers can't accept standard arguments other than the
urlargument; the rest of the arguments must be keyword-arguments so your code must be likeFetcher.get('https://google.com', stealthy_headers=True)notFetcher.get('https://google.com', True)if you were doing that for some reason! - An annoying difference between browser-based fetchers and their session classes since v0.3 was that the argument used to pass custom parser settings per request was called
custom_config, while it was namedselector_configin the session classes. This refactor allowed us to unify the naming toselector_configwithout breaking your code, so the main one is nowselector_configwith backward compatibility for thecustom_configargument. The autocompletion support will be available only for theselector_configargument. - Also, to achieve all of this, we had to make the type hints of the fetchers' functions dynamically generated, so if you don't get a proper autocompletion in your IDE, make sure you are using a modern version of it. We have tested almost all known IDEs/editors.
We have also updated all benchmark tables with the current numbers against the latest versions of all alternative libraries.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.9
A new update with many important changes
π New Stuff and quality of life changes
- Now the
impersonateargument inFetcherandFetcherSessioncan accept a list of browsers that the library will choose a random browser from them with each request.
from scrapling.fetchers import FetcherSession
with FetcherSession(impersonate=['chrome', 'firefox', 'safari']) as s:
s.get('https://github.com/D4Vinci/Scrapling')- A new argument to the
cleanmethod inTextHandlerto remove html entities from the current text easily. - Huge improvements to the documentation with more precise explanations of many parts and automatic translations of the main
README.mdfile.
π Bug Fixes
- Fixed a big issue with retrieving responses from browser-based fetchers. Now, there is intelligent content type detection that ensures
response.bodycontains the rendered browser content only if the content is HTML; otherwise, it contains the raw content of the last request made. This allows you to download binary files and text-based files without having to find them wrapped in HTML tags, while being able to retrieve the rendered content you want from the website when fetching it.
π¨ Misc
- Updated the contributing guide to make it clearer and easier.
- Add a new workflow to enforce code quality tools (Same ones used as pre-commit hooks).
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.8
A new update with many important changes
π New Stuff and quality of life changes
- For all browser-based fetchers: websites that never finish loading their requests won't crash the code now if you used
network_idlewith them. - The logic for collecting/checking for page content in browser-based fetchers has been changed to make browsers more stable on Windows systems now, as Linux/MacOS (All this difference in behaviour is because of Playwright's different implementation on Windows systems).
- Refactored all the validation logic, which made all requests done from all browser-based fetchers faster by 8-15%
- A New option called
extra_flagshas been added toDynamicFetcherand its session to allow users to add custom Chrome flags to the existing ones while launching the browser. - Reverted the route logic for catching responses (changed in the last version) to use the old routing version when
page_actionis used. This was added to collect the latest version of a page's content in casepage_actionchanges it without making a request. (Thanks for @gembleman to pointing it in #100 and #102 )
π Bug Fixes
- Fixed a typo in
load_domin DynamicSession's async_fetch - Fixed an issue with Cloudflare solver that made the solver wait forever for embedded captchas that don't disappear after solving. Now it will wait for the captcha to disappear for 30 seconds, then assume it's the type that doesn't disappear (Fixes #100 )
π¨ Misc
- Now the Docker image is automatically pushed to Dockerhub and GitHub's container registry for user convenience.
- Added a new documentation page to show how to use Scrapeless browser with Scrapling.
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.7
A new update with many important changes
π New Stuff and quality of life changes
- Reworked
solve_cloudflareargument inStealthyFetcherto make it able to solve all kinds of custom implementations of Turnstile. - Refactored the entire codebase to be acceptable by Pyright, so expect a flawless IDE experience now with all software and many bugs solved.
- Refactored the requests logic to be cleaner and faster (Also solves #97 )
- Added a new option
user_data_dirto all browser-based session classes to allow the user to reuse the browser session data (cookies/storage/etc...) from previous sessions. Leaving it will cause Playwright to use a random directory on each run, as was happening before. - Added a new customization option
additional_argstoDynamic fetcherand its session class to enable the user to pass extra arguments to Playwright's context, as we had withStealthyFetcherbefore. - The route logic for collecting the last navigation response for all browsers has been improved, which allows the raw responses to be passed to the parser before being processed by the browsers as before. This will be very helpful with text/JSON responses.
π Bug Fixes
- The rework of the route logic solved an issue with retrieving the content of unstable websites on some Windows devices.
- All the refactors that happened in this version solved a lot of bugs along the way that were hard to spot before, and weird autocompletion issues with some IDEs.
- Many fixes to the documentation website
π Special thanks to our Discord community for all the continuous testing and feedback
Big shoutout to our biggest Sponsors
Release v0.3.6
π New Stuff
- Improved the
solve_cloudflareargument inStealthyFetcherand its session classes to be able to solve all types of both Turnstile and interstitial Cloudflare challenges π - Now the MCP server has the option to use
Streamable HTTP, so you can easily expose the server. - Added Docker support, so now an image is built and pushed to Docker Hub automatically with each release (contains all browsers)
π Bug Fixes
- Fixed an encoding issue with the parser that happened in some cases (the famous
invalid start byteerror) - Restructured multiple parts of the library to fix some memory leaks, so now enjoy noticably lower memory usage based on your config (Also solves #92 )
- Improved type annotation in many parts of the code so you can have a better IDE experience (Also solves #93 )
π Special thanks to our Discord community for all the continuous testing and feedback








