Skip to content

utf-8 error during post-append indexing #30

@edsu

Description

@edsu

We've run into a few situations where WARCs delivered from Archive-It cause our indexing to break when using the --post-append option . The error is:

Traceback (most recent call last):
  File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 553, in <module>
    main()
    ~~~~^^
  File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 518, in main
    write_cdx_index(cmd.output, cmd.inputs, vars(cmd))
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 533, in write_cdx_index
    indexer.process_all()
    ~~~~~~~~~~~~~~~~~~~^^
  File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 214, in process_all
    super().process_all()
    ~~~~~~~~~~~~~~~~~~~^^
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/warcio/indexer.py", line 33, in process_all
    self.process_one(fh, out, filename)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
  File "/Users/edsu/Projects/cdxj-indexer/cdxj_indexer/main.py", line 248, in process_one
    for record in wrap_it:
                  ^^^^^^^
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/bufferiter.py", line 50, in buffering_record_iter
    join_req_resp(req, resp, post_append, url_key_func)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/bufferiter.py", line 110, in join_req_resp
    query, append_str = append_method_query_from_req_resp(req, resp)
                        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 26, in append_method_query_from_req_resp
    return append_method_query(method, content_type, len_, stream, url)
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 35, in append_method_query
    query = query_extract(content_type, len_, stream, url)
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/cdxj_indexer/postquery.py", line 107, in query_extract
    values.append((part.name, part.value))
                              ^^^^^^^^^^
  File "/Users/edsu/.pyenv/versions/3.13.0/lib/python3.13/site-packages/multipart.py", line 878, in value
    return self.raw.decode(self.charset)
           ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 76: invalid continuation byte

I've attached a minimal WARC file containing real request/response records that lets you reproduce the problem:

record.warc.zip

$ cdxj_indexer --post-append records.warc

It looks to me like there is a part in request body that is an application/octet-stream containing binary data, but that is being parsed as utf-8?

------WebKitFormBoundaryXcDvJJ9bNZr1ZUTB
Content-Disposition: form-data; name="post_0"; filename="blob"
Content-Type: application/octet-stream

...

See: sul-dlss/was_robot_suite#805

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

Ready

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions