|
| 1 | +--- |
| 2 | +id: request-storage |
| 3 | +title: Request storage |
| 4 | +description: How to store the requests your crawler will go through |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | + |
| 9 | +import Tabs from '@theme/Tabs'; |
| 10 | +import TabItem from '@theme/TabItem'; |
| 11 | +import CodeBlock from '@theme/CodeBlock'; |
| 12 | + |
| 13 | +import RqBasicExample from '!!raw-loader!./code/request_storage_rq_basic.py'; |
| 14 | +import RqWithCrawlerExample from '!!raw-loader!./code/request_storage_rq_with_crawler.py'; |
| 15 | +import RqWithCrawlerExplicitExample from '!!raw-loader!./code/request_storage_rq_with_crawler_explicit.py'; |
| 16 | + |
| 17 | +import RlBasicExample from '!!raw-loader!./code/request_storage_rl_basic.py'; |
| 18 | +import RlWithCrawlerExample from '!!raw-loader!./code/request_storage_rl_with_crawler.py'; |
| 19 | + |
| 20 | +import RsHelperAddRequestsExample from '!!raw-loader!./code/request_storage_helper_add_requests.py'; |
| 21 | +import RsHelperEnqueueLinksExample from '!!raw-loader!./code/request_storage_helper_enqueue_links.py'; |
| 22 | + |
| 23 | +import RsDoNotPurgeExample from '!!raw-loader!./code/request_storage_do_not_purge.py'; |
| 24 | +import RsPurgeExplicitlyExample from '!!raw-loader!./code/request_storage_purge_explicitly.py'; |
| 25 | + |
| 26 | +This guide explains the different types of request storage available in Crawlee, how to store the requests that your crawler will process, and which storage type to choose based on your needs. |
| 27 | + |
| 28 | +## Request providers overview |
| 29 | + |
| 30 | +All request storage types in Crawlee implement the same interface - <ApiLink to="class/RequestProvider">`RequestProvider`</ApiLink>. This unified interface allows them to be used in a consistent manner, regardless of the storage backend. The request providers are managed by storage clients - subclasses of <ApiLink to="class/BaseStorageClient">`BaseStorageClient`</ApiLink>. For instance, <ApiLink to="class/MemoryStorageClient">`MemoryStorageClient`</ApiLink> stores data in memory while it can also offload them to the local directory. Data are stored in the following directory structure: |
| 31 | + |
| 32 | +```text |
| 33 | +{CRAWLEE_STORAGE_DIR}/{request_provider}/{QUEUE_ID}/ |
| 34 | +``` |
| 35 | +:::note |
| 36 | + |
| 37 | +Local directory is specified by the `CRAWLEE_STORAGE_DIR` environment variable with default value `./storage`. `{QUEUE_ID}` is the name or ID of the specific request storage. The default value is `default`, unless we override it by setting the `CRAWLEE_DEFAULT_REQUEST_QUEUE_ID` environment variable. |
| 38 | + |
| 39 | +::: |
| 40 | + |
| 41 | +## Request queue |
| 42 | + |
| 43 | +The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is the primary storage for URLs in Crawlee, especially useful for deep crawling. It supports dynamic addition and removal of URLs, making it ideal for recursive tasks where URLs are discovered and added during the crawling process (e.g., following links across multiple pages). Each Crawlee project has a **default request queue**, which can be used to store URLs during a specific run. The <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> is highly useful for large-scale and complex crawls. |
| 44 | + |
| 45 | +The following code demonstrates the usage of the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>: |
| 46 | + |
| 47 | +<Tabs groupId="request_queue"> |
| 48 | + <TabItem value="request_queue_basic_example" label="Basic usage" default> |
| 49 | + <CodeBlock className="language-python"> |
| 50 | + {RqBasicExample} |
| 51 | + </CodeBlock> |
| 52 | + </TabItem> |
| 53 | + <TabItem value="request_queue_with_crawler" label="Usage with Crawler"> |
| 54 | + <CodeBlock className="language-python"> |
| 55 | + {RqWithCrawlerExample} |
| 56 | + </CodeBlock> |
| 57 | + </TabItem> |
| 58 | + <TabItem value="request_queue_with_crawler_explicit" label="Explicit usage with Crawler" default> |
| 59 | + <CodeBlock className="language-python"> |
| 60 | + {RqWithCrawlerExplicitExample} |
| 61 | + </CodeBlock> |
| 62 | + </TabItem> |
| 63 | +</Tabs> |
| 64 | + |
| 65 | +## Request list |
| 66 | + |
| 67 | +The <ApiLink to="class/RequestList">`RequestList`</ApiLink> is a simpler, lightweight storage option, used when all URLs to be crawled are known upfront. It represents the list of URLs to crawl that is stored in a crawler run memory (or optionally in default <ApiLink to="class/KeyValueStore">`KeyValueStore`</ApiLink> associated with the run, if specified). The list is used for the crawling of a large number of URLs, when we know all the URLs which should be visited by the crawler and no URLs would be added during the run. The URLs can be provided either in code or parsed from a text file hosted on the web. The <ApiLink to="class/RequestList">`RequestList`</ApiLink> is typically created exclusively for a single crawler run, and its usage must be explicitly specified. |
| 68 | + |
| 69 | +:::warning |
| 70 | + |
| 71 | +The <ApiLink to="class/RequestList">`RequestList`</ApiLink> class is in its early version and is not fully |
| 72 | +implemented. It is currently intended mainly for testing purposes and small-scale projects. The current |
| 73 | +implementation is only in-memory storage and is very limited. It will be (re)implemented in the future. |
| 74 | +For more details, see the GitHub issue [crawlee-python#99](https://github.com/apify/crawlee-python/issues/99). |
| 75 | +For production usage we recommend to use the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>. |
| 76 | + |
| 77 | +::: |
| 78 | + |
| 79 | +The following code demonstrates the usage of the <ApiLink to="class/RequestList">`RequestList`</ApiLink>: |
| 80 | + |
| 81 | +<Tabs groupId="request_list"> |
| 82 | + <TabItem value="request_list_basic_example" label="Basic usage" default> |
| 83 | + <CodeBlock className="language-python"> |
| 84 | + {RlBasicExample} |
| 85 | + </CodeBlock> |
| 86 | + </TabItem> |
| 87 | + <TabItem value="request_list_with_crawler" label="Usage with Crawler"> |
| 88 | + <CodeBlock className="language-python"> |
| 89 | + {RlWithCrawlerExample} |
| 90 | + </CodeBlock> |
| 91 | + </TabItem> |
| 92 | +</Tabs> |
| 93 | + |
| 94 | +{/* |
| 95 | +
|
| 96 | +## Which one to choose? |
| 97 | +
|
| 98 | +TODO: write this section, once https://github.com/apify/crawlee-python/issues/99 is resolved |
| 99 | +
|
| 100 | +*/} |
| 101 | + |
| 102 | +## Request-related helpers |
| 103 | + |
| 104 | +We offer several helper functions to simplify interactions with request storages: |
| 105 | + |
| 106 | +- The <ApiLink to="class/AddRequestsFunction">`add_requests`</ApiLink> function allows you to manually add specific URLs to the configured request storage. In this case, you must explicitly provide the URLs you want to be added to the request storage. If you need to specify further details of the request, such as a `label` or `user_data`, you have to pass instances of the <ApiLink to="class/Request">`Request`</ApiLink> class to the helper. |
| 107 | +- The <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> function is designed to discover new URLs in the current page and add them to the request storage. It can be used with default settings, requiring no arguments, or you can customize its behavior by specifying link element selectors, choosing different enqueue strategies, or applying include/exclude filters to control which URLs are added. See [Crawl website with relative links](../examples/crawl-website-with-relative-links) example for more details. |
| 108 | + |
| 109 | +<Tabs groupId="request_helpers"> |
| 110 | + <TabItem value="request_helper_add_requests" label="Add requests" default> |
| 111 | + <CodeBlock className="language-python"> |
| 112 | + {RsHelperAddRequestsExample} |
| 113 | + </CodeBlock> |
| 114 | + </TabItem> |
| 115 | + <TabItem value="request_helper_enqueue_links" label="Enqueue links"> |
| 116 | + <CodeBlock className="language-python"> |
| 117 | + {RsHelperEnqueueLinksExample} |
| 118 | + </CodeBlock> |
| 119 | + </TabItem> |
| 120 | +</Tabs> |
| 121 | + |
| 122 | +## Cleaning up the storages |
| 123 | + |
| 124 | +Default storages are purged before the crawler starts, unless explicitly configured otherwise. For that case, see <ApiLink to="class/Configuration#purge_on_start">`Configuration.purge_on_start`</ApiLink>. This cleanup happens as soon as a storage is accessed, either when you open a storage (e.g. using <ApiLink to="class/RequestQueue#open">`RequestQueue.open`</ApiLink>) or when interacting with a storage through one of the helper functions (e.g. <ApiLink to="class/AddRequestsFunction">`add_requests`</ApiLink> or <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>, which implicitly opens the request storage). |
| 125 | + |
| 126 | +<CodeBlock className="language-python"> |
| 127 | + {RsDoNotPurgeExample} |
| 128 | +</CodeBlock> |
| 129 | + |
| 130 | +If you do not explicitly interact with storages in your code, the purging will occur automatically when the <ApiLink to="class/BasicCrawler#run">`BasicCrawler.run`</ApiLink> method is invoked. |
| 131 | + |
| 132 | +If you need to purge storages earlier, you can call <ApiLink to="class/MemoryStorageClient#purge_on_start">`MemoryStorageClient.purge_on_start`</ApiLink> directly. This method triggers the purging process for the underlying storage implementation you are currently using. |
| 133 | + |
| 134 | +<CodeBlock className="language-python"> |
| 135 | + {RsPurgeExplicitlyExample} |
| 136 | +</CodeBlock> |
0 commit comments