-
Notifications
You must be signed in to change notification settings - Fork 511
refactor!: Introduce new storage client system #1194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
8758ca9 to
6b7b8bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Publishing some comments, not finished with reviewing the whole change yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was a loooot of work. Very nice!
tests/unit/storage_clients/_file_system/test_fs_dataset_client.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good. Just please add section to the upgrading_to_v0x.md to summarize all the breaking changes in this.
|
That's excellent work! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this a lot. I do need to revisit the request queue related code though, it feels like we're throwing out the baby with the bathwater.
src/crawlee/storage_clients/_file_system/_key_value_store_client.py
Outdated
Show resolved
Hide resolved
src/crawlee/storage_clients/_file_system/_request_queue_client.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the storage client system by removing legacy implementations and utilities, consolidating configuration, and updating documentation and examples to use the new clients.
- Consolidated storage-related settings in
Configurationand removed deprecated options. - Replaced legacy file utilities with
infer_mime_type,atomic_write, and export-to-stream functions. - Updated service locator to default to
FileSystemStorageClientand revised examples to use new storage clients.
Reviewed Changes
Copilot reviewed 92 out of 92 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| src/crawlee/configuration.py | Simplified storage configuration fields to align with new clients. |
| src/crawlee/_utils/file.py | Removed old file helpers; added atomic writes, MIME inference, and stream exports. |
| src/crawlee/_service_locator.py | Changed default storage client to FileSystemStorageClient. |
| docs/deployment/code_examples/google/google_example.py | Updated cloud function example to use MemoryStorageClient. |
| docs/guides/request_loaders.mdx | Documentation updated to reflect handled_count and total_count properties. |
Comments suppressed due to low confidence (1)
docs/deployment/code_examples/google/google_example.py:19
- The example uses
timedeltabut does not import it. Addfrom datetime import timedeltaat the top of the file to avoid a NameError.
request_handler_timeout=timedelta(seconds=30),
d7b19ee to
7f2e6b0
Compare
9bad9db to
65a1361
Compare
Benchmark
Crawlee Py - Old memory client
All runtimes:
Average crawler runtime: 4.270311s Crawlee Py - Old file-system client
All runtimes:
Average crawler runtime: 5.046287s Crawlee Py - New memory clientAll runtimes:
Average crawler runtime: 1.625904s Crawlee Py - New file-system clientAll runtimes:
Average crawler runtime: 4.416033s Crawlee TS - Memory client
All runtimes:
Average crawler runtime: 2.2503s Crawlee TS - File-system client
All runtimes:
Average crawler runtime: 3.2483s Scrapy - memory*
All runtimes:
Average crawler runtime: 1.477900s Summary
|
|
end of an era 🎉 |
### Description - Integration of the Crawlee v1 changes, mostly new storages & storage clients (introduced in apify/crawlee-python#1194). ### Issues - Closes: #469 - Closes: #540 ### Testing - The current test set covers the changes. --------- Co-authored-by: Josef Prochazka <[email protected]>
Description
Configuration.persist_storageandConfiguration.persist_metadataoptions were removed.purge_on_start, ortokenandbase_api_urlfor the Apify client) are configured via theConfiguration.purgemethod (which clears all items but preserves the storage and metadata) and adropmethod (which removes the entire storage, metadata included).Dataset
idnamemetadataopenpurge(new method)droppush_dataget_dataiterate_itemslist_items(new method)export_tofrom_storage_objectmethod has been removed - Use theopenmethod withnameoridinstead.get_info->metadatapropertystorage_object->metadatapropertyset_metadatamethod has been removed (it wasn't propage to clients)write_to_json-> method has been removed, useexport_toinsteadwrite_to_csv-> method has been removed, useexport_toinsteadKey-value store
idnamemetadataopenpurge(new method)dropget_valueset_valuedelete_value(new method, Apify platform's set_value support setting an empty value to a key, so having a separate method for deleting is useful)iterate_keyslist_keys(new method)get_public_urlget_auto_saved_valuepersist_autosaved_valuesfrom_storage_objectmethod has been removed - Use theopenmethod withnameoridinstead.get_info->metadatapropertystorage_object->metadatapropertyset_metadatamethod has been removed (it wasn't propage to clients)Request queue
idnamemetadataopenpurge(new method)dropadd_requestadd_requests_batched->add_requestsfetch_next_requestget_requestmark_request_as_handledreclaim_requestis_emptyis_finishedfrom_storage_objectmethod has been removed - Use theopenmethod withnameoridinstead.get_info->metadatapropertystorage_object->metadatapropertyset_metadatamethod has been removed (it wasn't propage to clients)get_handled_countmethod had been removed - Usemetadata.handled_request_countinstead.get_total_countmethod has been removed - Usemetadata.total_request_countinstead.resource_directoryfrom theRequestQueueMetadatawas removed, usepath_to...property instead.RequestQueueHeadmodel has been removed - UseRequestQueueHeadWithLocksinstead.add_requestscontainforefrontarg (Apify API supports it)BaseDatasetClient
metadataopenpurgedroppush_dataget_dataiterate_itemsBaseKeyValueStoreClient
metadataopenpurgedropget_valueset_valuedelete_valueiterate_keysget_public_urlBaseRequestQueueClient
metadataopenpurgedropadd_requests_batch->add_batch_of_requests(one backend method for 2 frontend methods)get_requestfetch_next_requestmark_request_as_handledreclaim_requestis_emptyRequestQueueHeadWithLocks->RequestQueueHeadBatchRequestsOperationResponse->AddRequestsResponse_sequencefield in the FS Request)Issues
MemoryStorageClientandFilesystemStorageClient#92creation_managementmodule #147push_dataannotations to useJsonSerializabletype #1191Testing
file-systemandmemory), ensuring every storage test runs against every client implementation.Checklist