Release Release v0.3.2 · D4Vinci/Scrapling

Release Notes for v0.3.2

🚀 New Stuff

Optional fetcher dependencies: All fetchers are now part of optional dependency groups, reducing core package size. So the base scrapling module is now the parser only, and to use the fetchers or the commandline options, you have to do: pip install "scrapling[fetchers]". Check out the detailed installation instructions from here
Per-page configuration in sessions: Session classes for browser fetchers now support individual configuration per page in sessions. All fetch-level parameters are now validated like session-level ones. More details on the documentation website here

Example:
```
with StealthySession(headless=True, solve_cloudflare=True) as session:
    page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
```
Improved browser-based fetchers
- A new option to control whether to wait for JavaScript execution to finish in pages or not (it's enabled by default now, as it was before)
```
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
   page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
```
- The Stealth mode is now more reliable in DynamicFetcher and its session classes.
- Both DynamicFetcher and StealthyFetcher are now using fewer resources (Automatically finding and closing the default tab opened by Persistent contexts in Playwright API)
- Fixed a vital logic bug in browser-based fetchers' pages rotation - previous pages are now replaced with fresh ones. (Tabs that get reused in rotation are possibly contaminated from previous settings used on them)
- StealthyFetcher and its session classes are now slightly faster (5%)

Enhanced .body property: Now returns the passed content as-is without processing, enabling file downloads and handling non-HTML requests. Below is an example of downloading a photo:

from scrapling.fetchers import Fetcher

page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/poster.png')
with open(file='poster.png', mode='wb') as f:
   f.write(page.body)

🐛 Bug Fixes

Encoding issues resolved: Fixed multiple encoding problems that happened with some websites in parser, mcp mode, and extract commands (Also solves #80 and #81)
Faster parsing: Due to many changes here and there, the library is now faster, and it's reflected in the updated benchmarks

🔨 Misc

Updated benchmarks: Refreshed performance benchmarks to compare the current speed improvements to the latest versions of similar libraries
Refactored a lot of the code and replaced dead code with better implementations: Fewer code, cleaner code, easier maintenance
Added YouTube video: Included video content for MCP documentation.
A new issues template: Easy new template for users who can't use the current templates.
CI workflow optimization: Tests workflow now skips runs when only documentation or non-code files are changed.
Updated dependencies: Bumped up various dependencies to the latest versions.
Code style improvements: Applied new ruff rules across all files.
Pre-commit hooks: Updated pre-commit configuration.

🎯 Breaking Changes

Removed max_pages parameter from sync StealthySession to match DynamicSession (it's meaningless to have in the sync version)

🙏 Special thanks to our Discord community for all the continuous testing and feedback

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release v0.3.2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🚀 New Stuff

🐛 Bug Fixes

🔨 Misc

🎯 Breaking Changes

Big shoutout to our biggest Sponsors

Uh oh!