Skip to content

Release v0.3.2

Choose a tag to compare

@github-actions github-actions released this 15 Sep 01:29
· 372 commits to main since this release
fac92ef

Release Notes for v0.3.2

πŸš€ New Stuff

  • Optional fetcher dependencies: All fetchers are now part of optional dependency groups, reducing core package size. So the base scrapling module is now the parser only, and to use the fetchers or the commandline options, you have to do: pip install "scrapling[fetchers]". Check out the detailed installation instructions from here

  • Per-page configuration in sessions: Session classes for browser fetchers now support individual configuration per page in sessions. All fetch-level parameters are now validated like session-level ones. More details on the documentation website here

    Example:

    with StealthySession(headless=True, solve_cloudflare=True) as session:
        page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
  • Improved browser-based fetchers

    • A new option to control whether to wait for JavaScript execution to finish in pages or not (it's enabled by default now, as it was before)
      with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session:
         page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
    • The Stealth mode is now more reliable in DynamicFetcher and its session classes.
    • Both DynamicFetcher and StealthyFetcher are now using fewer resources (Automatically finding and closing the default tab opened by Persistent contexts in Playwright API)
    • Fixed a vital logic bug in browser-based fetchers' pages rotation - previous pages are now replaced with fresh ones. (Tabs that get reused in rotation are possibly contaminated from previous settings used on them)
    • StealthyFetcher and its session classes are now slightly faster (5%)
  • Enhanced .body property: Now returns the passed content as-is without processing, enabling file downloads and handling non-HTML requests. Below is an example of downloading a photo:

    from scrapling.fetchers import Fetcher
    
    page = Fetcher.get('https://raw.githubusercontent.com/D4Vinci/Scrapling/main/images/poster.png')
    with open(file='poster.png', mode='wb') as f:
       f.write(page.body)

πŸ› Bug Fixes

  • Encoding issues resolved: Fixed multiple encoding problems that happened with some websites in parser, mcp mode, and extract commands (Also solves #80 and #81)
  • Faster parsing: Due to many changes here and there, the library is now faster, and it's reflected in the updated benchmarks

πŸ”¨ Misc

  • Updated benchmarks: Refreshed performance benchmarks to compare the current speed improvements to the latest versions of similar libraries
  • Refactored a lot of the code and replaced dead code with better implementations: Fewer code, cleaner code, easier maintenance
  • Added YouTube video: Included video content for MCP documentation.
  • A new issues template: Easy new template for users who can't use the current templates.
  • CI workflow optimization: Tests workflow now skips runs when only documentation or non-code files are changed.
  • Updated dependencies: Bumped up various dependencies to the latest versions.
  • Code style improvements: Applied new ruff rules across all files.
  • Pre-commit hooks: Updated pre-commit configuration.

🎯 Breaking Changes

  • Removed max_pages parameter from sync StealthySession to match DynamicSession (it's meaningless to have in the sync version)

πŸ™ Special thanks to our Discord community for all the continuous testing and feedback


Big shoutout to our biggest Sponsors