Skip to content

Optimize LZ OutWindow.CopyBlock#907

Merged
adamhathcock merged 1 commit intoadamhathcock:masterfrom
jdpurcell:pr-copyblockoptim
Mar 20, 2025
Merged

Optimize LZ OutWindow.CopyBlock#907
adamhathcock merged 1 commit intoadamhathcock:masterfrom
jdpurcell:pr-copyblockoptim

Conversation

@jdpurcell
Copy link
Contributor

@jdpurcell jdpurcell commented Mar 19, 2025

After a bit of profiling, this function seemed like a good candidate for optimization. Testing on a large-ish archive (412 MB), I'm seeing extraction times like:

CPU Before After % of Original
Apple M3 31.1 s 26.8 s 86%
Core i7-6700k 37.2 s 33.8 s 91%

@jdpurcell jdpurcell marked this pull request as draft March 19, 2025 06:03
@jdpurcell
Copy link
Contributor Author

jdpurcell commented Mar 20, 2025

After more testing I found that my initial commit despite helping a lot with the Qt archive, could slow down others. An example is here (426 MB). I investigated the differences and determined that CopyBlock is called in two different ways, one directly, and one indirectly via CopyPending. Despite my best attempts, I couldn't figure out a way to avoid splitting them up. It's very counterintuitive, it almost seems like CopyPending could benefit more from this optimization since it's often copying more data on average, but I think the issue is about 7% of the time in my testing it gets called with a 1-length request, whereas CopyBlock never does. And there's no way to work around the performance hit with ifs; the only thing I can guess is that 7% is enough to really mess up the branch prediction. (EDIT: I failed to notice although CopyPending has the potential to copy more data judging by _pendingLen, it's often limited to just a single byte by the _total vs _limit comparison) For whatever reason, archives like the Qt one are biased way more to calling CopyBlock vs CopyPending, whereas some others are quite the opposite.

As much as I hated having to duplicate some of the code, I did my best to clean it up at the same time so it's not too bad overall in my opinion. After splitting it up like this, I still get the big performance gains from the Qt archive, I found a couple others that were quicker too, and then another handful including the one I just linked that at least don't regress in performance anymore (if anything maybe a slight increase but it's probably a drop in the bucket). Despite the perhaps-not-as-simple code, I feel it's worth it for the performance potential. If you agree, feel free to merge, and if not, no worries - I am already using a custom build and could continue to do so.

@jdpurcell jdpurcell marked this pull request as ready for review March 20, 2025 03:30
Copy link
Owner

@adamhathcock adamhathcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copying for a reason is always good.

@adamhathcock
Copy link
Owner

Thanks!

@adamhathcock adamhathcock merged commit 227f66f into adamhathcock:master Mar 20, 2025
4 checks passed
@jdpurcell jdpurcell deleted the pr-copyblockoptim branch March 20, 2025 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants