Skip to content

SURT are not created for HTTP CONNECT requests in WARC file #20

@ARiedijk

Description

@ARiedijk

Hi, we are using this cdx-indexer tool and found out that while replaying our Wacz files in Replayweb.page player, sometimes certain resources were not found, while they were present in the Warc files.

What turned out in our Warc files are CONNECT requests and these are not converted to a SURT. For example, url=distillery.wistia.com:443 remains after surt.surt(url) method call distillery.wistia.com:443. The Replayweb.page player checks whether the index.idx has a surt, using useSurt = prefix.indexOf(")/") > 0; in the MultiWacz.js. If by chance the last line has a CONNECT then this block is considered surt = false in the cdx. Then querying in the browser DB using the upperBound method does not work properly.

Given:

A warc file with:

WARC/1.0
Content-Length: 308
Content-Type: application/http;msgtype=request
WARC-Block-Digest: sha1:XDTRC67IG3EYGKYRBFK7BOYLBRJHW52X
WARC-Date: 2022-09-14T14:45:01Z
WARC-Payload-Digest: sha1:3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ
WARC-Record-ID: <urn:uuid:d083e59a-e1c5-4079-bb20-cf6115fa342d>
WARC-Target-URI: distillery.wistia.com:443
WARC-Type: request

CONNECT distillery.wistia.com:443 HTTP/1.1
Accept-Encoding: *, compress;q=0, br;q=0
Content-Length: 0
Host: distillery.wistia.com:443
Proxy-Connection: keep-alive
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/105.0.5195.102 Safari/537.36

When running the cdxj_indexer with the following parameters:

main.py -p -o index.idx -c index.cdx.gz -s -d -l 1024 small.warc

Then the result in de index is:

!meta 0 {"format": "cdxj-gzip-1.0", "filename": "c:\\temp\\index.cdx.gz"}
distillery.wistia.com:443 20220914144501 {"offset": 0, "length": 371, "digest": "sha256:8e8d3aa0f13b077615de09a2d349121130ec5fca9783c97d10c07721e1d13585"}

excepted:

com,wistia,distillery)/ 20220914144501 {"offset": 0, "length": 377, "digest": "sha256:b75ede157ec02f31a25126270771b287d1ccc42554c9678ebc2c1446249a554d"}

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions