We've run into a few situations where WARCs delivered from Archive-It cause our indexing to break when using the --post-append option . The error is:
Traceback (most recent call last):
File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 553, in <module>
main()
~~~~^^
File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 518, in main
write_cdx_index(cmd.output, cmd.inputs, vars(cmd))
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 533, in write_cdx_index
indexer.process_all()
~~~~~~~~~~~~~~~~~~~^^
File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 214, in process_all
super().process_all()
~~~~~~~~~~~~~~~~~~~^^
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/warcio/indexer.py", line 33, in process_all
self.process_one(fh, out, filename)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 248, in process_one
for record in wrap_it:
^^^^^^^
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/bufferiter.py", line 50, in buffering_record_iter
join_req_resp(req, resp, post_append, url_key_func)
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/bufferiter.py", line 110, in join_req_resp
query, append_str = append_method_query_from_req_resp(req, resp)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 26, in append_method_query_from_req_resp
return append_method_query(method, content_type, len_, stream, url)
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 35, in append_method_query
query = query_extract(content_type, len_, stream, url)
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 107, in query_extract
values.append((part.name, part.value))
^^^^^^^^^^
File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/multipart.py", line 878, in value
return self.raw.decode(self.charset)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 76: invalid continuation byte
I've attached a minimal WARC file containing real request/response records that lets you reproduce the problem:
$ cdxj_indexer --post-append records.warc
------WebKitFormBoundaryXcDvJJ9bNZr1ZUTB
Content-Disposition: form-data; name="post_0"; filename="blob"
Content-Type: application/octet-stream
...
We've run into a few situations where WARCs delivered from Archive-It cause our indexing to break when using the
--post-appendoption . The error is:I've attached a minimal WARC file containing real request/response records that lets you reproduce the problem:
record.warc.zip
It looks to me like there is a part in request body that is an
application/octet-streamcontaining binary data, but that is being parsed as utf-8?See: sul-dlss/was_robot_suite#805