Complete rewrite of XZ handler. #363

qkaiser · 2022-05-05T21:37:37Z

In order to emit a single ValidChunk for concatenated XZ compression streams, we converted the handler to use Hyperscan like we did with bzip2 handler (see c970077).

The idea is that for a given match, we use Hyperscan to look for end of stream markers after our current start offset. If we have a match, we attempt to decompress our candidate chunk using unblob's file iterator (iterate_file) to limit memory consumption. If decompression is successful, we adjust start and end offsets of our context and instruct Hyperscan to continue looking for other end of stream marker matches. Otherwise, we stop the search and return.

Closes #362

martonilles

Maybe also add a test file for a multi-stream file?

unblob/handlers/compression/xz.py

martonilles

I would also add a few unit tests at least to cover the stream continuation and padding logic
Also an integration test example for a multi-stream file would be nice

unblob/handlers/compression/xz.py

qkaiser · 2022-05-13T13:02:59Z

@martonilles fixed the code following your remarks. I had forgotten to add the test file to my commit :) Let me know what you think. I'll squash the commits when approved.

tests/handlers/compression/test_xz.py

qkaiser · 2022-05-17T09:41:20Z

@martonilles no obvious timing differences in pytest runtime (compared with previously merged branches).

The presence of '#' characters in the ASCII part of legit hexdumps was leading to flaky tests. We fixed this by removing comment parsing in unhex().

In order to emit a single ValidChunk for concatenated XZ compression streams, we converted the handler to use Hyperscan like we did with bzip2 handler (see c970077). The idea is that for a given match, we use Hyperscan to look for end of stream markers after our current start offset. If we have a match, we read the whole chunk in memory and attempt to decompress it. If decompression is successful, we adjust start and end offsets of our context and instruct Hyperscan to continue looking for other end of stream marker matches. Otherwise, we stop the search and return.

qkaiser self-assigned this May 6, 2022

qkaiser added this to the v2.5 - Detailed metadata milestone May 6, 2022

qkaiser added enhancement New feature or request format:compression labels May 6, 2022

martonilles requested changes May 9, 2022

View reviewed changes

qkaiser force-pushed the 362-fix-xz-chunk branch 4 times, most recently from 3906728 to 8352ee0 Compare May 12, 2022 14:48

qkaiser requested a review from martonilles May 12, 2022 14:48

martonilles requested changes May 13, 2022

View reviewed changes

qkaiser force-pushed the 362-fix-xz-chunk branch from 8352ee0 to 4bf7c27 Compare May 13, 2022 13:00

qkaiser force-pushed the 362-fix-xz-chunk branch from 4bf7c27 to 7bb9fd5 Compare May 13, 2022 13:06

martonilles requested changes May 16, 2022

View reviewed changes

qkaiser force-pushed the 362-fix-xz-chunk branch from 7bb9fd5 to f0ce0ce Compare May 17, 2022 07:26

qkaiser added 2 commits May 17, 2022 12:49

Remove the ability to have comments in hexdumps fed to unhex().

a5b49ba

The presence of '#' characters in the ASCII part of legit hexdumps was leading to flaky tests. We fixed this by removing comment parsing in unhex().

qkaiser force-pushed the 362-fix-xz-chunk branch from bf21055 to a4d11dc Compare May 17, 2022 10:50

martonilles approved these changes May 17, 2022

View reviewed changes

martonilles merged commit 6bcfced into main May 17, 2022

martonilles deleted the 362-fix-xz-chunk branch May 17, 2022 13:19

Complete rewrite of XZ handler. #363

Complete rewrite of XZ handler. #363

Uh oh!

Conversation

qkaiser commented May 5, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

martonilles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

martonilles left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qkaiser commented May 13, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qkaiser commented May 17, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

qkaiser commented May 5, 2022 •

edited

Loading