-
Notifications
You must be signed in to change notification settings - Fork 268
Description
HTTP protocol implementations add all HTTP response header fields to the response metadata. The FetcherBolt merges the response metadata into the metadata which is passed forward in the topology as part of the tuples <url, content, metadata>. Existing key-value(s) pairs (persisted in the status index) are overwritten and later eventually stored in the status index (if listed in "metadata.persist") or even transferred to outlinks ("metadata.transfer").
This allows to easily use the response header values for requests, e.g. cookies or "ETag".
However, webadmins are free to send any header back. This may cause unwanted collisions with metadata keys used by crawler-internal classes. E.g. if the server responds with non-standard headers "HostName" or "Depth". Or even standard headers such as "Last-modified" which require to follow a specific format for internal use.
To avoid collisions: Why not prefix protocol metadata: protocol.content-type or http.content-type? This would also make it clear which component sets the metadata - similar to the prefixes fetch. and parse. already used. The draw-back would be that users are required to update the configuration.