Skip to content

Conversation

@qkaiser
Copy link
Contributor

@qkaiser qkaiser commented May 5, 2022

In order to emit a single ValidChunk for concatenated XZ compression streams, we converted the handler to use Hyperscan like we did with bzip2 handler (see c970077).

The idea is that for a given match, we use Hyperscan to look for end of stream markers after our current start offset. If we have a match, we attempt to decompress our candidate chunk using unblob's file iterator (iterate_file) to limit memory consumption. If decompression is successful, we adjust start and end offsets of our context and instruct Hyperscan to continue looking for other end of stream marker matches. Otherwise, we stop the search and return.

Closes #362

@qkaiser qkaiser self-assigned this May 6, 2022
@qkaiser qkaiser added this to the v2.5 - Detailed metadata milestone May 6, 2022
@qkaiser qkaiser added enhancement New feature or request format:compression labels May 6, 2022
Copy link
Contributor

@martonilles martonilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe also add a test file for a multi-stream file?

@qkaiser qkaiser force-pushed the 362-fix-xz-chunk branch 4 times, most recently from 3906728 to 8352ee0 Compare May 12, 2022 14:48
@qkaiser qkaiser requested a review from martonilles May 12, 2022 14:48
Copy link
Contributor

@martonilles martonilles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add a few unit tests at least to cover the stream continuation and padding logic
Also an integration test example for a multi-stream file would be nice

@qkaiser qkaiser force-pushed the 362-fix-xz-chunk branch from 8352ee0 to 4bf7c27 Compare May 13, 2022 13:00
@qkaiser
Copy link
Contributor Author

qkaiser commented May 13, 2022

@martonilles fixed the code following your remarks. I had forgotten to add the test file to my commit :) Let me know what you think. I'll squash the commits when approved.

@qkaiser qkaiser force-pushed the 362-fix-xz-chunk branch from 4bf7c27 to 7bb9fd5 Compare May 13, 2022 13:06
@qkaiser qkaiser force-pushed the 362-fix-xz-chunk branch from 7bb9fd5 to f0ce0ce Compare May 17, 2022 07:26
@qkaiser
Copy link
Contributor Author

qkaiser commented May 17, 2022

@martonilles no obvious timing differences in pytest runtime (compared with previously merged branches).

qkaiser added 2 commits May 17, 2022 12:49
The presence of '#' characters in the ASCII part of legit hexdumps was
leading to flaky tests. We fixed this by removing comment parsing in
unhex().
In order to emit a single ValidChunk for concatenated XZ compression
streams, we converted the handler to use Hyperscan like we did with
bzip2 handler (see c970077).

The idea is that for a given match, we use Hyperscan to look for end of
stream markers after our current start offset. If we have a match, we
read the whole chunk in memory and attempt to decompress it. If
decompression is successful, we adjust start and end offsets of our
context and instruct Hyperscan to continue looking for other end of
stream marker matches. Otherwise, we stop the search and return.
@qkaiser qkaiser force-pushed the 362-fix-xz-chunk branch from bf21055 to a4d11dc Compare May 17, 2022 10:50
@martonilles martonilles merged commit 6bcfced into main May 17, 2022
@martonilles martonilles deleted the 362-fix-xz-chunk branch May 17, 2022 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request format:compression

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stop emitting a ValidChunk for every xz stream

3 participants