Skip to content

Prefix protocol metadata to avoid that internal metadata fields are accidentally overwritten #776

@sebastian-nagel

Description

@sebastian-nagel

HTTP protocol implementations add all HTTP response header fields to the response metadata. The FetcherBolt merges the response metadata into the metadata which is passed forward in the topology as part of the tuples <url, content, metadata>. Existing key-value(s) pairs (persisted in the status index) are overwritten and later eventually stored in the status index (if listed in "metadata.persist") or even transferred to outlinks ("metadata.transfer").

This allows to easily use the response header values for requests, e.g. cookies or "ETag".

However, webadmins are free to send any header back. This may cause unwanted collisions with metadata keys used by crawler-internal classes. E.g. if the server responds with non-standard headers "HostName" or "Depth". Or even standard headers such as "Last-modified" which require to follow a specific format for internal use.

To avoid collisions: Why not prefix protocol metadata: protocol.content-type or http.content-type? This would also make it clear which component sets the metadata - similar to the prefixes fetch. and parse. already used. The draw-back would be that users are required to update the configuration.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions