Describe the bug
All processing stops when there is a malformed url.
Steps to reproduce the bug
For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns:
Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/
And stops the whole process.
Expected behavior
Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?
Describe the bug
All processing stops when there is a malformed url.
Steps to reproduce the bug
For the url "http://eosims.asf.alaska.edu:12355.edu:80/" the cdxj-indexer returns:
Traceback (most recent call last):
File "/mnt/c/Users/pgomes/Desktop/Code/venv/bin/cdx-indexer", line 8, in
sys.exit(main())
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 469, in main
minimal=cmd.minimal_cdxj)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/cdxindexer.py", line 301, in write_multi_cdx_index
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 339, in call
for entry in entry_iter:
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/indexer/archiveindexer.py", line 172, in create_record_iter
entry['urlkey'] = canonicalize(entry['url'], surt_ordered)
File "/mnt/c/Users/pgomes/Desktop/Code/venv/lib/python3.6/site-packages/pywb/utils/canonicalize.py", line 48, in canonicalize
raise UrlCanonicalizeException('Invalid Url: ' + url)
pywb.utils.canonicalize.UrlCanonicalizeException: Invalid Url: http://eosims.asf.alaska.edu:12355.edu:80/
And stops the whole process.
Expected behavior
Wouldn't it be better to analyze record a record? If there is an error, will it continue to process the next record for the same warc?