Skip to content

Poor compression of binary numeric data with surprising results #3014

@RedBeard0531

Description

@RedBeard0531

Describe the bug

I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.

Interesting findings:

  1. Compression levels 1 through 17 behave roughly the same regardless of input, reducing to roughly 67% of initial size.
  2. Something changes at level 18 and for most inputs it gives significantly better results
  3. --format=xz -1 still compresses this data much better and faster than --format=zstd -18
  4. Zstd is very sensitive to endianness and base value, while xz barely changes (I wrote this before noticing that I wasn't passing -1 when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)

Results:

endian base zstd -18 xz -1 notes
Big 0 10.46% in 0.791s 6.47% in 0.272s best result for zstd
Big 0x1234567 26.84% in 0.578s 6.38% in 0.263s
Little 0 67.21% in 0.532s 2.52% in 0.243s Worst result for zstd, -18 is same compression as -1.
Best result for xz, although default level compresses to 6.30%
Little 0x12345678 25.43% in 0.568ms 6.37% in 0.257 With a non-zero base, endianness doesn't matter

To Reproduce

I used this python file to generate the test data, toggling the commented out lines.

from array import array

BASE = 0
#BASE = 0x12345678 # uncomment to test non-zero high bytes

a = array('i')
for i in range(1000000):
    a.append(BASE + i)

#a.byteswap() # uncomment to try bigendian

with open('test.bin', 'wb') as f:
    a.tofile(f)

I ran python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.

Expected behavior

I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:

  1. The sensitivity to endianness is very surprising to me. It might indicate either a bug or a low hanging fruit for improvement.
  2. The stark change going from level 17 to level 18 for very regular data like this implies that there is some feature enabled at 18 that isn't enabled earlier. Maybe it should be? And if not, it would be nice to know what it is so that use cases where this is likely can opt in to it separately without also using larger window sizes and the like.
  3. I was actually more surprised by how little xz was affected by choice of base value at its default level, but the extent that it affected zstd seemed excessive.
  4. It seemed interesting that the best case for xz -1 was also the worst case for zstd -18.

Screenshots and charts
None

Desktop (please complete the following information):

  • OS: Arch Linux in WSL2
  • Version: *** zstd command line interface 64-bits v1.5.1, by Yann Collet ***
  • Compiler/Flags: I used the default arch packages. I think they are using gcc -O2.
  • Other relevant hardware specs: 8-core i9-9880H
  • Build system: pacman

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions