Skip to content

Problem when URL is malformed #8

@PedroG1515

Description

@PedroG1515

Describe the bug

All processing stops when there is a malformed url.

Steps to reproduce the bug

For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns:

Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/

And stops the whole process.

Expected behavior

Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions