Releases: openzim/python-scraperlib
Releases · openzim/python-scraperlib
5.3.0
5.2.0
Added
- Add utility to fetch and prepare ZIM illustration (#254)
- New
zim.dedup.Deduplicatorclass to handle automatic deduplication of content before adding to the ZIM (#33)
Changed
- Upgrade dependencies, especially wombat 3.9.1 (#262, #263)
- Backport changes in wabac.js around JS rewriting rules (#259, #265)
Fixed
5.1.1
5.1.0
5.0.0
5.0.0rc4
5.0.0rc3
5.0.0rc2
5.0.0rc1
This is a major release with a lot of breaking changes but most changes are easy to fix.
It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.
Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io
Main changes includes:
- ZIM metadata handling has completely changed with new types for each kind of metadata.
i18nmodule has been redesigned around a single main classLanguage- New
rewritingmodule for HTTML/CSS/JS (that one being done at runtime via Wombat) - Now supporting only Python 3.12
Added
- Documentation using
mkdocs, published on readthedocs.com (#92) rewritingmodule to rewrite URLs in content for generic scrapersrewriting.cssto rewrite URLs in CSS filesrewriting.htmlto rewrite URLs in HTML filesrewriting.jsto rewrite URLs in JS files (at runtime, usingwombat)wombat-setupjavascript module injavascript/
typingmodule with custom types:Callbackto use where we expect callbacksSupportsWrite,SupportsRead,SupportsSeekingSupportsSeekableReadandSupportsSeekableWrite: protocols for IO type annotations
zim.metadatamodule with a type-based approach for each kind of metadata and helpers for custom ones- [
zim.metadata]APPLY_RECOMMENDATIONS: general flag to toggle openZIM-recommended constraints - [
zim.metadata] Type-based classes:Metadata,TextBasedMetadata,TextListBasedMetadata,DateBasedMetadata,IllustrationBasedMetadata - [
zim.metadata] Usage-based classes:NameMetadata,LanguageMetadata,DefaultIllustrationMetadata, etc. - [
zim.metadata]StandardMetadataListto package the standard metadata - See details for additional API endpoints and variables
- [
- [
constants]DEFAULT_WEB_REQUESTS_TIMEOUTexposed fordownloadmodule - [
download]stream_file()now acceptstimeout: intparam (defaults to constant timeout) (#222) - [
filesystem]path_fromcontext manager to acquire a pathlibPathfromPathorTemporaryDirectory - [
i18n]Language,get_language()andget_language_or_none(). See breaking changes - [
image.optimization]OptimizePngOptionsdataclass to store PNG options - [
image.optimization]OptimizeJpgOptionsdataclass to store JPEG options - [
image.optimization]OptimizeGifOptionsdataclass to store WebP options - [
image.optimization]OptimizeOptionsdataclass to store cross-formats options - [
inputs]unique_values()to deduplicate a list while preserving order - [
logging]DEFAULT_FORMAT_WITH_THREADSas many scrapers uses threads - [
video.encoding]reencode()'sexisting_tmp_pathparam - [
zim.filesystem]validate_folder_writable()to ensure one can write into a folder (#200) - [
zim.creator]Creator._get_first_language_metadata_value()to retrieve first language from metadata - [
zim.items]no_indexing_indexdata()to get an IndexData that disables indexing - [
zim.items]URLItem.get_mimetype()now only returningstr
Changed (Breaking)
- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [
constants]MANDATORY_ZIM_METADATA_KEYSandDEFAULT_DEV_ZIM_METADATAmoved tozim/metadata - [
download]YoutubeDownloader.download'soptionsparameters now expect andict[str, Any]instead ofdict - [
download]YoutubeConfigoptions now limited tostr | bool | int | None - [
download]_get_retry_adapter()now exposed asget_retry_adapter() - [
download]stream_file'sbyte_stream' param now more flexible, acceptingSupportsWrite[bytes] | SupportsSeekableWrite[bytes]` - [
download]stream_file'sproxiesparam now acceptingdict[str, str]instead ofdict - [
filesystem]delete_callback()is now a simple callback accepting anfpathand deleting it (doesn't chain other callback anymore). - [
filesystem]delete_callback()doesn't fail on missing file (#192) - [
i18n] Redesigned API around a single object:Languagewhich is inited with any acceptable code. RaisesNotFoundErroron 639-3 matching failurefind_language_names()is retained but only accepts aquery: str- added
get_language()andget_language_or_none()as shortcuts aroundLanguage is_valid_iso_639_3()is retained
- [
image.conversion]convert_image()now acceptsio.BytesIOin place ofIO[bytes]forsrcanddst. - [
image.conversion]convert_svg2png()now acceptsio.BytesIOin place ofIO[bytes]forsrcanddst. - [
image.optimization]optimize_png()now acceptsoptions: OptimizePngOptionsinstead of individual params. - [
image.optimization]optimize_jpeg()now acceptsoptions: OptimizeJpgOptionsinstead of individual params. - [
image.optimization]optimize_webp()now acceptsoptions: OptimizeWebpOptionsinstead of individual params. - [
image.optimization]optimize_gif()now acceptsoptions: OptimizeGifOptionsinstead of individual params. - [
image.presets] All presets now use the new options dataclass instead of ClassVar dict - [
image.probing]format_for()now acceptsio.BytesIOin place ofIO[bytes]forsrc. - [
image.probing]is_valid_image()now acceptsio.BytesIOin place ofIO[bytes]forimage. - [
image.utils]save_image()now acceptsio.BytesIOin place ofIO[bytes]fordst. - [
video.config]Configwas mostly not using type annotations. - [
video.config]Configoptions only expectingstr | None - [
video.presets] All options only expectingstr | None - [
video.encoding]reencode()now always returning atuple[bool, CompletedProcess] - [
zim._libkiwix]MimetypeAndCounternow expects specific types formimetype: strandvalue: int - [
zim.filesystem]make_zim_file()publisherparam now properly expects anstr` - [
zim.filesystem]IncorrectZIMPathErrorrenamed toIncorrectPathError - [
zim.filesystem]MissingZIMFolderErrorrenamed toMissingFolderError - [
zim.filesystem]NotADirectoryZIMFolderErrorrenamed toNotADirectoryFolderError - [
zim.filesystem]NotWritableZIMFolderErrorrenamed toNotWritableFolderError - [
zim.filesystem]IncorrectZIMFilenameErrorrenamed toIncorrectFilenameError - [
zim.filesystem]validate_zimfile_creatable()renamed tovalidate_file_creatable() - [
zim.items]ItemandStaticItemnow expectinghintsasdict[libzim.writer.Hint, int]instead ofdict - [
zim.items]Item.get_hints()now returningdict[libzim.writer.Hint, int]instead ofdict - [
zim.items]URLItem.download_for_size()now specifying type annotations and reordered params - [
zim.providers]FileLikeProvider.gen_blob()andURLProvider.gen_blob()now properly annotates return type (Generator[libzim.writer.Blob, None, None]) - [
zim.providers]URLProvider.get_size_of()paramurlnow explicitly expects anstr - [
zim.creator]Creator.config_metadata()signature changed, now mainly accepting aStandardMetadataList - [
zim.creator]Creator.config_dev_metadata()signature changed to accept new metadata types - [
zim.creator]Creator.add_item_for()'scallbackrenamed tocallbacksand acceptingCallback - [
zim.creator]Creator.add_item()'scallbackrenamed tocallbacksand acceptingCallback
Changed
- [deps]
iso639-langnow requires at least v2.4.0 - [
download]stream_file()now returntuple[int, requests.structures.CaseInsensitiveDict[str]]instead oftuple[int, requests.structures.CaseInsensitiveDict] - [
download]stream_file()now accepts bothfpathandbyte_streamparams (writes to both) - [
image.utils]save_image()now acceptsAny**params. - [
zim.archive]Archive.countersnow returningCounterMap(compatible with previousdict[str, int])
Fixed
- Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
- [
download]YoutubeDownloader.downloadnow respects its return type (bool | Future[Any]) - [
image.conversion]convert_image()**paramsproperly declared as acceptingNone. - [
logging]getLogger()'s'consolenow properly acceptingTextIO | io.StringIO | None - [
video.probing]get_media_info()type annotation forsrc_path - [
zim.archive]Archive.get_item()return type (libzim.reader.Item)
Removed
- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [
i18n]Lang(See breaking changes) - [
i18n]get_iso_lang_data()(See breaking changes) - [
i18n]update_with_macro()(See breaking changes) - [
i18n]get_language_details()(See breaking changes) - [
uri]rebuild_urifailsafeparam (was only handling incorrect types) - [
video.encoding]reencode()'swith_processparam - [
zim.creator]Creator.validate_metadata() - [
zim.creator]Creator.convert_and_check_metadata()
4.0.0
Added
- Add utility function to compute ZIM Tags #164, including deduplication #156
- Metadata does not automatically drops control characters #159
- New
indexing.IndexDataclass to hold title, content and keywords to pass to libzim to index an item - Automatically index PDF documents content #167
- Automatically set proper title on PDF documents #168
- Expose new
optimization.get_optimization_methodto get the proper optimization method to call for a given image format - Add
optimization.get_optimization_methodto get the proper optimization method to call for a given image format - New
creator.Creator.convert_and_check_metadatato convert metadata to bytes or str for known use cases and check proper type is passed to libzim - Add svg2png image conversion function #113
- Add
conversion.convert_svg2pngimage conversion function + support for SVG inprobing.format_for#113 - Add
i18n.Langclass used as typed result of i18n operations #151
Changed
- BREAKING Renamed
zimscraperlib.image.convertiontozimscraperlib.image.conversionto fix typo - BREAKING Many changes in type hints to match the real underlying code
- BREAKING Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
- Prefer to use
IO[bytes]toio.BytesIOwhen possible since it is more generic - BREAKING
i18n.NotFoundrenamedi18n.NotFoundError - BREAKING
types.get_mime_for_namenow returnsstr | None - BREAKING
creator.Creator.add_metadataandcreator.Creator.validate_metadatanow only acceptsbytes | stras value (it must have been converted before call) - BREAKING second argument of
creator.Creator.add_metadatahas been renamed tovalueinstead ofcontentto align with other methods - When a type issue arises in metadata checks, wrong value type is displayed in exception
- BREAKING
i18n.get_language_details(),i18n.get_iso_lang_data(),i18n.find_language_names()andi18n.update_with_macronow process / return a new typedLangclass #151 - BREAKING Rename
i18.NotFoundtoi18n.NotFoundError
Removed
- BREAKING Remove translation features in
i18n:Localeclass +_andsetlocalefunctions #134