-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Describe the bug
I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.
Interesting findings:
- Compression levels 1 through 17 behave roughly the same regardless of input, reducing to roughly 67% of initial size.
- Something changes at level 18 and for most inputs it gives significantly better results
--format=xz -1still compresses this data much better and faster than--format=zstd -18- Zstd is very sensitive to endianness and base value, while xz barely changes (I wrote this before noticing that I wasn't passing
-1when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)
Results:
| endian | base | zstd -18 | xz -1 | notes |
|---|---|---|---|---|
| Big | 0 | 10.46% in 0.791s | 6.47% in 0.272s | best result for zstd |
| Big | 0x1234567 | 26.84% in 0.578s | 6.38% in 0.263s | |
| Little | 0 | 67.21% in 0.532s | 2.52% in 0.243s | Worst result for zstd, -18 is same compression as -1. Best result for xz, although default level compresses to 6.30% |
| Little | 0x12345678 | 25.43% in 0.568ms | 6.37% in 0.257 | With a non-zero base, endianness doesn't matter |
To Reproduce
I used this python file to generate the test data, toggling the commented out lines.
from array import array
BASE = 0
#BASE = 0x12345678 # uncomment to test non-zero high bytes
a = array('i')
for i in range(1000000):
a.append(BASE + i)
#a.byteswap() # uncomment to try bigendian
with open('test.bin', 'wb') as f:
a.tofile(f)I ran python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.
Expected behavior
I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:
- The sensitivity to endianness is very surprising to me. It might indicate either a bug or a low hanging fruit for improvement.
- The stark change going from level 17 to level 18 for very regular data like this implies that there is some feature enabled at 18 that isn't enabled earlier. Maybe it should be? And if not, it would be nice to know what it is so that use cases where this is likely can opt in to it separately without also using larger window sizes and the like.
- I was actually more surprised by how little xz was affected by choice of base value at its default level, but the extent that it affected zstd seemed excessive.
- It seemed interesting that the best case for
xz -1was also the worst case forzstd -18.
Screenshots and charts
None
Desktop (please complete the following information):
- OS: Arch Linux in WSL2
- Version:
*** zstd command line interface 64-bits v1.5.1, by Yann Collet *** - Compiler/Flags: I used the default arch packages. I think they are using gcc -O2.
- Other relevant hardware specs: 8-core i9-9880H
- Build system: pacman
Additional context
Add any other context about the problem here.