Skip to content

Commit 734af8c

Browse files
authored
docs: Update Request loaders guide (#1376)
- Adaptive crawler - prefer combining with Parsel. - Highlight some important parts of the code samples. - Prefer Impit over Httpx. - Expose `FingerprintGenerator` in the public API docs. - Add a note to the request loaders guide to highlight the usage with crawlers.
1 parent a513ac0 commit 734af8c

File tree

12 files changed

+147
-89
lines changed

12 files changed

+147
-89
lines changed

docs/guides/code_examples/playwright_crawler_adaptive/handler.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55

66

77
async def main() -> None:
8-
crawler = AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser()
8+
crawler = AdaptivePlaywrightCrawler.with_parsel_static_parser()
99

1010
@crawler.router.default_handler
1111
async def request_handler(context: AdaptivePlaywrightCrawlingContext) -> None:

docs/guides/code_examples/request_loaders/rl_basic_example.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ async def main() -> None:
1818
# Fetch and process requests from the queue.
1919
while request := await request_list.fetch_next_request():
2020
# Do something with it...
21+
print(f'Processing {request.url}')
2122

2223
# And mark it as handled.
2324
await request_list.mark_request_as_handled(request)

docs/guides/code_examples/request_loaders/rl_tandem_example.py

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,9 +8,11 @@ async def main() -> None:
88
# Create a static request list.
99
request_list = RequestList(['https://crawlee.dev', 'https://apify.com'])
1010

11+
# highlight-start
1112
# Convert the request list to a request manager using the to_tandem method.
1213
# It is a tandem with the default request queue.
1314
request_manager = await request_list.to_tandem()
15+
# highlight-end
1416

1517
# Create a crawler and pass the request manager to it.
1618
crawler = ParselCrawler(
@@ -20,9 +22,20 @@ async def main() -> None:
2022

2123
@crawler.router.default_handler
2224
async def handler(context: ParselCrawlingContext) -> None:
25+
context.log.info(f'Processing {context.request.url}')
26+
2327
# New links will be enqueued directly to the queue.
2428
await context.enqueue_links()
2529

30+
# Extract data using Parsel's XPath and CSS selectors.
31+
data = {
32+
'url': context.request.url,
33+
'title': context.selector.xpath('//title/text()').get(),
34+
}
35+
36+
# Push extracted data to the dataset.
37+
await context.push_data(data)
38+
2639
await crawler.run()
2740

2841

docs/guides/code_examples/request_loaders/rl_tandem_example_explicit.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,20 @@ async def main() -> None:
2323

2424
@crawler.router.default_handler
2525
async def handler(context: ParselCrawlingContext) -> None:
26+
context.log.info(f'Processing {context.request.url}')
27+
2628
# New links will be enqueued directly to the queue.
2729
await context.enqueue_links()
2830

31+
# Extract data using Parsel's XPath and CSS selectors.
32+
data = {
33+
'url': context.request.url,
34+
'title': context.selector.xpath('//title/text()').get(),
35+
}
36+
37+
# Push extracted data to the dataset.
38+
await context.push_data(data)
39+
2940
await crawler.run()
3041

3142

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import asyncio
2+
import re
3+
4+
from crawlee.http_clients import ImpitHttpClient
5+
from crawlee.request_loaders import SitemapRequestLoader
6+
7+
8+
async def main() -> None:
9+
# Create an HTTP client for fetching the sitemap.
10+
http_client = ImpitHttpClient()
11+
12+
# Create a sitemap request loader with filtering rules.
13+
sitemap_loader = SitemapRequestLoader(
14+
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
15+
http_client=http_client,
16+
include=[re.compile(r'.*docs.*')], # Only include URLs containing 'docs'.
17+
max_buffer_size=500, # Keep up to 500 URLs in memory before processing.
18+
)
19+
20+
while request := await sitemap_loader.fetch_next_request():
21+
# Do something with it...
22+
print(f'Processing {request.url}')
23+
24+
# And mark it as handled.
25+
await sitemap_loader.mark_request_as_handled(request)
26+
27+
28+
if __name__ == '__main__':
29+
asyncio.run(main())

docs/guides/code_examples/request_loaders/sitemap_example.py

Lines changed: 0 additions & 28 deletions
This file was deleted.

docs/guides/code_examples/request_loaders/sitemap_tandem_example.py

Lines changed: 41 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -2,38 +2,51 @@
22
import re
33

44
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
5-
from crawlee.http_clients import HttpxHttpClient
5+
from crawlee.http_clients import ImpitHttpClient
66
from crawlee.request_loaders import SitemapRequestLoader
77

88

99
async def main() -> None:
10-
# Create an HTTP client for fetching sitemaps
11-
async with HttpxHttpClient() as http_client:
12-
# Create a sitemap request loader with URL filtering
13-
sitemap_loader = SitemapRequestLoader(
14-
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
15-
http_client=http_client,
16-
# Include only URLs that contain 'docs'
17-
include=[re.compile(r'.*docs.*')],
18-
max_buffer_size=500, # Buffer up to 500 URLs in memory
19-
)
20-
21-
# Convert the sitemap loader to a request manager using the to_tandem method.
22-
# It is a tandem with the default request queue.
23-
request_manager = await sitemap_loader.to_tandem()
24-
25-
# Create a crawler and pass the request manager to it.
26-
crawler = ParselCrawler(
27-
request_manager=request_manager,
28-
max_requests_per_crawl=10, # Limit the max requests per crawl.
29-
)
30-
31-
@crawler.router.default_handler
32-
async def handler(context: ParselCrawlingContext) -> None:
33-
# New links will be enqueued directly to the queue.
34-
await context.enqueue_links()
35-
36-
await crawler.run()
10+
# Create an HTTP client for fetching the sitemap.
11+
http_client = ImpitHttpClient()
12+
13+
# Create a sitemap request loader with filtering rules.
14+
sitemap_loader = SitemapRequestLoader(
15+
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
16+
http_client=http_client,
17+
include=[re.compile(r'.*docs.*')], # Only include URLs containing 'docs'.
18+
max_buffer_size=500, # Keep up to 500 URLs in memory before processing.
19+
)
20+
21+
# highlight-start
22+
# Convert the sitemap loader into a request manager linked
23+
# to the default request queue.
24+
request_manager = await sitemap_loader.to_tandem()
25+
# highlight-end
26+
27+
# Create a crawler and pass the request manager to it.
28+
crawler = ParselCrawler(
29+
request_manager=request_manager,
30+
max_requests_per_crawl=10, # Limit the max requests per crawl.
31+
)
32+
33+
@crawler.router.default_handler
34+
async def handler(context: ParselCrawlingContext) -> None:
35+
context.log.info(f'Processing {context.request.url}')
36+
37+
# New links will be enqueued directly to the queue.
38+
await context.enqueue_links()
39+
40+
# Extract data using Parsel's XPath and CSS selectors.
41+
data = {
42+
'url': context.request.url,
43+
'title': context.selector.xpath('//title/text()').get(),
44+
}
45+
46+
# Push extracted data to the dataset.
47+
await context.push_data(data)
48+
49+
await crawler.run()
3750

3851

3952
if __name__ == '__main__':

docs/guides/code_examples/request_loaders/sitemap_tandem_example_explicit.py

Lines changed: 41 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,41 +2,52 @@
22
import re
33

44
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
5-
from crawlee.http_clients import HttpxHttpClient
5+
from crawlee.http_clients import ImpitHttpClient
66
from crawlee.request_loaders import RequestManagerTandem, SitemapRequestLoader
77
from crawlee.storages import RequestQueue
88

99

1010
async def main() -> None:
11-
# Create an HTTP client for fetching sitemaps
12-
async with HttpxHttpClient() as http_client:
13-
# Create a sitemap request loader with URL filtering
14-
sitemap_loader = SitemapRequestLoader(
15-
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
16-
http_client=http_client,
17-
# Include only URLs that contain 'docs'
18-
include=[re.compile(r'.*docs.*')],
19-
max_buffer_size=500, # Buffer up to 500 URLs in memory
20-
)
21-
22-
# Open the default request queue.
23-
request_queue = await RequestQueue.open()
24-
25-
# And combine them together to a single request manager.
26-
request_manager = RequestManagerTandem(sitemap_loader, request_queue)
27-
28-
# Create a crawler and pass the request manager to it.
29-
crawler = ParselCrawler(
30-
request_manager=request_manager,
31-
max_requests_per_crawl=10, # Limit the max requests per crawl.
32-
)
33-
34-
@crawler.router.default_handler
35-
async def handler(context: ParselCrawlingContext) -> None:
36-
# New links will be enqueued directly to the queue.
37-
await context.enqueue_links()
38-
39-
await crawler.run()
11+
# Create an HTTP client for fetching the sitemap.
12+
http_client = ImpitHttpClient()
13+
14+
# Create a sitemap request loader with filtering rules.
15+
sitemap_loader = SitemapRequestLoader(
16+
sitemap_urls=['https://crawlee.dev/sitemap.xml'],
17+
http_client=http_client,
18+
include=[re.compile(r'.*docs.*')], # Only include URLs containing 'docs'.
19+
max_buffer_size=500, # Keep up to 500 URLs in memory before processing.
20+
)
21+
22+
# Open the default request queue.
23+
request_queue = await RequestQueue.open()
24+
25+
# And combine them together to a single request manager.
26+
request_manager = RequestManagerTandem(sitemap_loader, request_queue)
27+
28+
# Create a crawler and pass the request manager to it.
29+
crawler = ParselCrawler(
30+
request_manager=request_manager,
31+
max_requests_per_crawl=10, # Limit the max requests per crawl.
32+
)
33+
34+
@crawler.router.default_handler
35+
async def handler(context: ParselCrawlingContext) -> None:
36+
context.log.info(f'Processing {context.request.url}')
37+
38+
# New links will be enqueued directly to the queue.
39+
await context.enqueue_links()
40+
41+
# Extract data using Parsel's XPath and CSS selectors.
42+
data = {
43+
'url': context.request.url,
44+
'title': context.selector.xpath('//title/text()').get(),
45+
}
46+
47+
# Push extracted data to the dataset.
48+
await context.push_data(data)
49+
50+
await crawler.run()
4051

4152

4253
if __name__ == '__main__':

docs/guides/request_loaders.mdx

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ import TabItem from '@theme/TabItem';
1010
import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';
1111

1212
import RlBasicExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_basic_example.py';
13-
import SitemapExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_example.py';
13+
import SitemapExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_basic_example.py';
1414
import RlTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_tandem_example.py';
1515
import RlExplicitTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/rl_tandem_example_explicit.py';
1616
import SitemapTandemExample from '!!raw-loader!roa-loader!./code_examples/request_loaders/sitemap_tandem_example.py';
@@ -102,6 +102,10 @@ RequestManager --|> RequestManagerTandem
102102

103103
The <ApiLink to="class/RequestLoader">`RequestLoader`</ApiLink> interface defines the foundation for fetching requests during a crawl. It provides abstract methods for basic operations like retrieving, marking, and checking the status of requests. Concrete implementations, such as <ApiLink to="class/RequestList">`RequestList`</ApiLink>, build on this interface to handle specific scenarios. You can create your own custom loader that reads from an external file, web endpoint, database, or any other specific data source. For more details, refer to the <ApiLink to="class/RequestLoader">`RequestLoader`</ApiLink> API reference.
104104

105+
:::info NOTE
106+
To learn how to use request loaders in your crawlers, see the [Request manager tandem](#request-manager-tandem) section below.
107+
:::
108+
105109
### Request list
106110

107111
The <ApiLink to="class/RequestList">`RequestList`</ApiLink> can accept an asynchronous generator as input, allowing requests to be streamed rather than loading them all into memory at once. This can significantly reduce memory usage, especially when working with large sets of URLs.

docs/guides/service_locator.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ There are three core services that are managed by the service locator:
3333

3434
### Configuration
3535

36-
<ApiLink to="class/Configuration">`Configuration`</ApiLink> is a class that provides access to application-wide settings and parameters. It allows you to configure various aspects of Crawlee, such as timeouts, logging level, persistance intervals, and various other settings. The configuration can be set directly in the code or via environment variables.
36+
<ApiLink to="class/Configuration">`Configuration`</ApiLink> is a class that provides access to application-wide settings and parameters. It allows you to configure various aspects of Crawlee, such as timeouts, logging level, persistence intervals, and various other settings. The configuration can be set directly in the code or via environment variables.
3737

3838
### StorageClient
3939

0 commit comments

Comments
 (0)