-
Notifications
You must be signed in to change notification settings - Fork 92
Complete rewrite of XZ handler. #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
martonilles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe also add a test file for a multi-stream file?
3906728 to
8352ee0
Compare
martonilles
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also add a few unit tests at least to cover the stream continuation and padding logic
Also an integration test example for a multi-stream file would be nice
|
@martonilles fixed the code following your remarks. I had forgotten to add the test file to my commit :) Let me know what you think. I'll squash the commits when approved. |
|
@martonilles no obvious timing differences in pytest runtime (compared with previously merged branches). |
The presence of '#' characters in the ASCII part of legit hexdumps was leading to flaky tests. We fixed this by removing comment parsing in unhex().
In order to emit a single ValidChunk for concatenated XZ compression streams, we converted the handler to use Hyperscan like we did with bzip2 handler (see c970077). The idea is that for a given match, we use Hyperscan to look for end of stream markers after our current start offset. If we have a match, we read the whole chunk in memory and attempt to decompress it. If decompression is successful, we adjust start and end offsets of our context and instruct Hyperscan to continue looking for other end of stream marker matches. Otherwise, we stop the search and return.
In order to emit a single ValidChunk for concatenated XZ compression streams, we converted the handler to use Hyperscan like we did with bzip2 handler (see c970077).
The idea is that for a given match, we use Hyperscan to look for end of stream markers after our current start offset. If we have a match, we attempt to decompress our candidate chunk using unblob's file iterator (iterate_file) to limit memory consumption. If decompression is successful, we adjust start and end offsets of our context and instruct Hyperscan to continue looking for other end of stream marker matches. Otherwise, we stop the search and return.
Closes #362