Scrapy plugin for Zyte API.
- Python 3.7+
- Scrapy 2.0.1+
pip install scrapy-zyte-api
To enable this plugin:
- Set the
httpandhttpskeys in the DOWNLOAD_HANDLERS Scrapy setting to"scrapy_zyte_api.ScrapyZyteAPIDownloadHandler". - Add
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware"to the DOWNLOADER_MIDDLEWARES Scrapy setting with any value, e.g.1000. - Set the REQUEST_FINGERPRINTER_CLASS
Scrapy setting to
"scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter". - Set the TWISTED_REACTOR
Scrapy setting to
"twisted.internet.asyncioreactor.AsyncioSelectorReactor". - Set your Zyte API key as
either the
ZYTE_API_KEYScrapy setting or as an environment variable of the same name.
For example, in the settings.py file of your Scrapy project:
DOWNLOAD_HANDLERS = {
"http": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
"https": "scrapy_zyte_api.ScrapyZyteAPIDownloadHandler",
}
DOWNLOADER_MIDDLEWARES = {
"scrapy_zyte_api.ScrapyZyteAPIDownloaderMiddleware": 1000,
}
REQUEST_FINGERPRINTER_CLASS = "scrapy_zyte_api.ScrapyZyteAPIRequestFingerprinter"
TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
ZYTE_API_KEY = "YOUR_API_KEY"The ZYTE_API_ENABLED setting, which is True by default, can be set to
False to disable this plugin.
You can send requests through Zyte API in one of the following ways:
- Send all request through Zyte API by default, letting Zyte API parameters be chosen automatically based on your Scrapy request parameters. See Using transparent mode below.
- Send specific requests through Zyte API, setting all Zyte API parameters manually, keeping full control of what is sent to Zyte API. See Sending requests with manually-defined parameters below.
- Send specific requests through Zyte API, letting Zyte API parameters be chosen automatically based on your Scrapy request parameters. See Sending requests with automatically-mapped parameters below.
Zyte API response parameters are mapped into Scrapy response parameters where possible. See Response mapping below for details.
Set the ZYTE_API_TRANSPARENT_MODE Scrapy setting to True to handle
Scrapy requests as follows:
By default, requests are sent through Zyte API with automatically-mapped parameters. See Sending requests with automatically-mapped parameters below for details about automatic request parameter mapping.
You do not need to set the
zyte_api_automaprequest meta key toTrue, but you can set it to a dictionary to extend your Zyte API request parameters.Requests with the
zyte_apirequest meta key set to adictare sent through Zyte API with manually-defined parameters. See Sending requests with manually-defined parameters below.Requests with the
zyte_api_automaprequest meta key set toFalseare not sent through Zyte API.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
start_urls = ["https://quotes.toscrape.com/"]
custom_settings = {
"ZYTE_API_TRANSPARENT_MODE": True,
}
def parse(self, response):
print(response.text)
# "<html>…</html>"To send a Scrapy request through Zyte API with manually-defined parameters,
define your Zyte API parameters in the zyte_api key in
Request.meta
as a dict.
The only exception is the url parameter, which should not be defined as a
Zyte API parameter. The value from Request.url is used automatically.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api": {
"browserHtml": True,
}
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"Note that response headers are necessary for raw response decoding. When
defining parameters manually and requesting httpResponseBody extraction,
remember to also request httpResponseHeaders extraction:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api": {
"httpResponseBody": True,
"httpResponseHeaders": True,
}
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"To learn more about Zyte API parameters, see the data extraction usage and API reference pages of the Zyte API documentation.
To send a Scrapy request through Zyte API letting Zyte API parameters be
automatically chosen based on the parameters of that Scrapy request, set the
zyte_api_automap key in
Request.meta
to True.
For example:
import scrapy
class SampleQuotesSpider(scrapy.Spider):
name = "sample_quotes"
def start_requests(self):
yield scrapy.Request(
url="https://quotes.toscrape.com/",
meta={
"zyte_api_automap": True,
},
)
def parse(self, response):
print(response.text)
# "<html>…</html>"See also Using transparent mode above and Automated request parameter mapping below.
Zyte API responses are mapped with one of the following classes:
scrapy_zyte_api.responses.ZyteAPITextResponse, a subclass ofscrapy.http.TextResponse, is used to map text responses, i.e. responses withbrowserHtmlor responses with bothhttpResponseBodyandhttpResponseHeaderswith a text body (e.g. plain text, HTML, JSON).scrapy_zyte_api.responses.ZyteAPIResponse, a subclass ofscrapy.http.Response, is used to map any other response.
Zyte API response parameters are mapped into response class attributes where possible:
urlbecomesresponse.url.statusCodebecomesresponse.status.httpResponseHeadersbecomesresponse.headers.browserHtmlandhttpResponseBodyare mapped into bothresponse.text(str) andresponse.body(bytes).If none of these parameters were present, e.g. if the only requested output was
screenshot,response.textandresponse.bodywould be empty.If a future version of Zyte API supported requesting both outputs on the same request, and both parameters were present,
browserHtmlwould be the one mapped intoresponse.textandresponse.body.
Both response classes have a raw_zyte_api attribute that contains a
dict with the complete, raw response from Zyte API, where you can find all
Zyte API response parameters, including those that are not mapped into other
response class atttributes.
For example, for a request for httpResponseBody and
httpResponseHeaders, you would get:
def parse(self, response):
print(response.url)
# "https://quotes.toscrape.com/"
print(response.status)
# 200
print(response.headers)
# {b"Content-Type": [b"text/html"], …}
print(response.text)
# "<html>…</html>"
print(response.body)
# b"<html>…</html>"
print(response.raw_api_response)
# {
# "url": "https://quotes.toscrape.com/",
# "statusCode": 200,
# "httpResponseBody": "PGh0bWw+4oCmPC9odG1sPg==",
# "httpResponseHeaders": […],
# }For a request for screenshot, on the other hand, the response would look
as follows:
def parse(self, response):
print(response.url)
# "https://quotes.toscrape.com/"
print(response.status)
# 200
print(response.headers)
# {}
print(response.text)
# ""
print(response.body)
# b""
print(response.raw_api_response)
# {
# "url": "https://quotes.toscrape.com/",
# "statusCode": 200,
# "screenshot": "iVBORw0KGgoAAAANSUh…",
# }
from base64 import b64decode
print(b64decode(response.raw_api_response["screenshot"]))
# b'\x89PNG\r\n\x1a\n\x00\x00\x00\r…'When you enable automated request parameter mapping, be it through transparent mode (see Using transparent mode above) or for a specific request (see Sending requests with automatically-mapped parameters above), Zyte API parameters are chosen as follows by default:
httpResponseBodyandhttpResponseHeadersare set toTrue.This is subject to change without prior notice in future versions of scrapy-zyte-api, so please account for the following:
If you are requesting a binary resource, such as a PDF file or an image file, set
httpResponseBodytoTrueexplicitly in your requests:Request( url="https://toscrape.com/img/zyte.png", meta={ "zyte_api_automap": {"httpResponseBody": True}, }, )
In the future, we may stop setting
httpResponseBodytoTrueby default, and instead use a different, new Zyte API parameter that only works for non-binary responses (e.g. HMTL, JSON, plain text).If you need to access response headers, be it through
response.headersor throughresponse.raw_zyte_api["httpResponseHeaders"], sethttpResponseHeaderstoTrueexplicitly in your requests:Request( url="https://toscrape.com/", meta={ "zyte_api_automap": {"httpResponseHeaders": True}, }, )
At the moment we request response headers because some response headers are necessary to properly decode the response body as text. In the future, Zyte API may be able to handle this decoding automatically, so we would stop setting
httpResponseHeaderstoTrueby default.
Request.urlbecomesurl, same as in requests with manually-defined parameters.If
Request.methodis something other than"GET", it becomeshttpRequestMethod.Request.headersbecomecustomHttpRequestHeaders.Request.bodybecomeshttpRequestBody.
For example, the following Scrapy request:
Request(
method="POST"
url="https://httpbin.org/anything",
headers={"Content-Type": "application/json"},
body=b'{"foo": "bar"}',
)Results in a request to the Zyte API data extraction endpoint with the following parameters:
{
"httpResponseBody": true,
"httpResponseHeaders": true,
"url": "https://httpbin.org/anything",
"httpRequestMethod": "POST",
"customHttpRequestHeaders": [{"name": "Content-Type", "value": "application/json"}],
"httpRequestBody": "eyJmb28iOiAiYmFyIn0="
}You may set the zyte_api_automap key in
Request.meta
to a dict of Zyte API parameters to extend or override choices made by
automated request parameter mapping.
Setting browserHtml or screenshot to True unsets
httpResponseBody and httpResponseHeaders, and makes Request.headers
become requestHeaders instead of customHttpRequestHeaders. For example,
the following Scrapy request:
Request(
url="https://quotes.toscrape.com",
headers={"Referer": "https://example.com/"},
meta={"zyte_api_automap": {"browserHtml": True}},
)Results in a request to the Zyte API data extraction endpoint with the following parameters:
{
"browserHtml": true,
"url": "https://quotes.toscrape.com",
"requestHeaders": {"referer": "https://example.com/"},
}When mapping headers, headers not supported by Zyte API are excluded from the mapping by default. Use the following Scrapy settings to change which headers are included or excluded from header mapping:
ZYTE_API_SKIP_HEADERSdetermines headers that must not be mapped ascustomHttpRequestHeaders, and its default value is:["Cookie", "User-Agent"]
ZYTE_API_BROWSER_HEADERSdetermines headers that can be mapped asrequestHeaders. It is adict, where keys are header names and values are the key that represents them inrequestHeaders. Its default value is:{"Referer": "referer"}
To maximize support for potential future changes in Zyte API, automated request parameter mapping allows some parameter values and parameter combinations that Zyte API does not currently support, and may never support:
Request.methodbecomeshttpRequestMethodeven for unsupportedhttpRequestMethodvalues, and even ifhttpResponseBodyis unset.You can set
customHttpRequestHeadersorrequestHeaderstoTrueto force their mapping fromRequest.headersin scenarios where they would not be mapped otherwise.Conversely, you can set
customHttpRequestHeadersorrequestHeaderstoFalseto prevent their mapping fromRequest.headers.Request.bodybecomeshttpRequestBodyeven ifhttpResponseBodyis unset.You can set
httpResponseBodytoFalse(which unsets the parameter), and not setbrowserHtmlorscreenshottoTrue. In this case,Request.headersis mapped asrequestHeaders.You can set
httpResponseBodytoTrueand also setbrowserHtmlorscreenshottoTrue. In this case,Request.headersis mapped both ascustomHttpRequestHeadersand asrequestHeaders, andbrowserHtmlis used as the Scrapy response body.
Often the same configuration needs to be used for all Zyte API requests. For
example, all requests may need to set the same geolocation, or the spider only
uses browserHtml requests.
The following settings allow you to define Zyte API parameters to be included in all requests:
ZYTE_API_DEFAULT_PARAMSis adictof parameters to be combined with manually-defined parameters. See Sending requests with manually-defined parameters above.You may set the
zyte_apirequest meta key to an emptydictto only use default parameters for that request.ZYTE_API_AUTOMAP_PARAMSis adictof parameters to be combined with automatically-mapped parameters. See Sending requests with automatically-mapped parameters above.
For example, if you set ZYTE_API_DEFAULT_PARAMS to
{"geolocation": "US"} and zyte_api to {"browserHtml": True},
{"url: "…", "geolocation": "US", "browserHtml": True} is sent to Zyte API.
Parameters in these settings are merged with request-specific parameters, with request-specific parameters taking precedence.
ZYTE_API_DEFAULT_PARAMS has no effect on requests that use automated
request parameter mapping, and ZYTE_API_AUTOMAP_PARAMS has no effect on
requests that use manually-defined parameters.
When using transparent mode (see Using transparent mode above), be careful
of which parameters you define through ZYTE_API_AUTOMAP_PARAMS. In
transparent mode, all Scrapy requests go through Zyte API, even requests that
Scrapy sends automatically, such as those for robots.txt files when
ROBOTSTXT_OBEY is True, or those for sitemaps when using a sitemap
spider. Certain parameters, like browserHtml or screenshot, are not
meant to be used for every single request.
API requests are retried automatically using the default retry policy of python-zyte-api.
API requests that exceed retries are dropped. You cannot manage API request retries through Scrapy downloader middlewares.
Use the ZYTE_API_RETRY_POLICY setting or the zyte_api_retry_policy
request meta key to override the default python-zyte-api retry policy with a
custom retry policy.
A custom retry policy must be an instance of tenacity.AsyncRetrying.
Scrapy settings must be picklable, which retry policies are not, so you cannot assign retry
policy objects directly to the ZYTE_API_RETRY_POLICY setting, and must use
their import path string instead.
When setting a retry policy through request meta, you can assign the
zyte_api_retry_policy request meta key either the retry policy object
itself or its import path string. If you need your requests to be serializable,
however, you may also need to use the import path string.
For example, to also retry HTTP 521 errors the same as HTTP 520 errors, you can subclass RetryFactory as follows:
# project/retry_policies.py
from tenacity import retry_if_exception, RetryCallState
from zyte_api.aio.errors import RequestError
from zyte_api.aio.retry import RetryFactory
def is_http_521(exc: BaseException) -> bool:
return isinstance(exc, RequestError) and exc.status == 521
class CustomRetryFactory(RetryFactory):
retry_condition = (
RetryFactory.retry_condition
| retry_if_exception(is_http_521)
)
def wait(self, retry_state: RetryCallState) -> float:
if is_http_521(retry_state.outcome.exception()):
return self.temporary_download_error_wait(retry_state=retry_state)
return super().wait(retry_state)
def stop(self, retry_state: RetryCallState) -> bool:
if is_http_521(retry_state.outcome.exception()):
return self.temporary_download_error_stop(retry_state)
return super().stop(retry_state)
CUSTOM_RETRY_POLICY = CustomRetryFactory().build()
# project/settings.py
ZYTE_API_RETRY_POLICY = "project.retry_policies.CUSTOM_RETRY_POLICY"Stats from python-zyte-api are exposed as Scrapy stats with the
scrapy-zyte-api prefix.
The request fingerprinter class of this plugin ensures that Scrapy 2.7 and later generate unique request fingerprints for Zyte API requests based on some of their parameters.
For example, a request for browserHtml and a request for screenshot
with the same target URL are considered different requests. Similarly, requests
with the same target URL but different actions are also considered
different requests.
The request fingerprinter class of this plugin generates request fingerprints for Zyte API requests based on the following Zyte API parameters:
url(canonicalized)For URLs that include a URL fragment, like
https://example.com#foo, URL canonicalization keeps the URL fragment ifbrowserHtmlorscreenshotare enabled.Request attribute parameters (
httpRequestBody,httpRequestMethod)Output parameters (
browserHtml,httpResponseBody,httpResponseHeaders,screenshot)Rendering option parameters (
actions,javascript,screenshotOptions)geolocation
The following Zyte API parameters are not taken into account for request fingerprinting:
- Request header parameters (
customHttpRequestHeaders,requestHeaders) - Metadata parameters (
echoData,jobId)
You can assign a request fingerprinter class to the
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS Scrapy setting to configure
a custom request fingerprinter class to use for requests that do not go through
Zyte API:
ZYTE_API_FALLBACK_REQUEST_FINGERPRINTER_CLASS = "custom.RequestFingerprinter"By default, requests that do not go through Zyte API use the default request fingerprinter class of the installed Scrapy version.
If you have a Scrapy version older than Scrapy 2.7, Zyte API parameters are not taken into account for request fingerprinting. This can cause some Scrapy components, like the filter of duplicate requests or the HTTP cache extension, to interpret 2 different requests as being the same.
To avoid most issues, use automated request parameter mapping, either through
transparent mode or setting zyte_api_automap to True in
Request.meta, and then use Request attributes instead of
Request.meta as much as possible. Unlike Request.meta, Request
attributes do affect request fingerprints in Scrapy versions older than Scrapy
2.7.
For requests that must have the same Request attributes but should still
be considered different, such as browser-based requests with different URL
fragments, you can set dont_filter to True on Request.meta to
prevent the duplicate filter of Scrapy to filter any of them out. For example:
yield Request(
"https://toscrape.com#1",
meta={"zyte_api_automap": {"browserHtml": True}},
dont_filter=True,
)
yield Request(
"https://toscrape.com#2",
meta={"zyte_api_automap": {"browserHtml": True}},
dont_filter=True,
)Note, however, that for other Scrapy components, like the HTTP cache extensions, these 2 requests would still be considered identical.