Poor compression of binary numeric data with surprising results

**Describe the bug**

I wrote a simple test to produce a file with 1 million 4-byte incrementing ints, and tried to compress with various settings. I tried big and little endian, and I tried starting at both 0 and 0x1234567 to fill all bytes. I'm sure there are a lot more interesting cases, but this seemed like a good starting point and revealed a lot already. Zstd generally did fairly poorly, but there were some interesting results that give me a glimmer of hope that there is room for significant improvement.

Interesting findings:
1. Compression levels 1 through 17 behave roughly the same regardless of input, reducing to roughly 67% of initial size.
2. Something changes at level 18 and for most inputs it gives significantly better results
3. `--format=xz -1` still compresses this data much better and faster than `--format=zstd -18`
4. Zstd is very sensitive to endianness and base value, while xz barely changes (I wrote this before noticing that I wasn't passing `-1` when testing xz as I intended to. When I added it and reran the testing, only one realy changed size, and it got smaller...)

Results:

|endian|base|zstd -18|xz -1|notes|
|-|-|:-|:-|-|
|Big|0|10.46% in 0.791s|6.47% in 0.272s|best result for zstd|
|Big|0x1234567|26.84% in 0.578s|6.38% in 0.263s| |
|Little|0|67.21% in 0.532s|2.52% in 0.243s|Worst result for zstd, -18 is same compression as -1. <br> Best result for xz, although default level compresses to 6.30%|
|Little|0x12345678| 25.43% in 0.568ms| 6.37% in 0.257|With a non-zero base, endianness doesn't matter|

**To Reproduce**

I used this python file to generate the test data, toggling the commented out lines.

```python
from array import array

BASE = 0
#BASE = 0x12345678 # uncomment to test non-zero high bytes

a = array('i')
for i in range(1000000):
    a.append(BASE + i)

#a.byteswap() # uncomment to try bigendian

with open('test.bin', 'wb') as f:
    a.tofile(f)
```

I ran `python ints.py && time zstd --format=zstd -18 test.bin -o /dev/null; time zstd --format=xz -1 test.bin -o /dev/null` to see the compression ratios and times for zstd and xz. Other formats and levels were also tested, but that is the command I used to produce the table above.

**Expected behavior**

I know this isn't necessarily the target use case for zstd, but it still seemed to do much worse than I expected. In particular:

1. The sensitivity to endianness is very surprising to me. It might indicate either a bug or a low hanging fruit for improvement.
2. The stark change going from level 17 to level 18 for very regular data like this implies that there is some feature enabled at 18 that isn't enabled earlier. Maybe it should be? And if not, it would be nice to know what it is so that use cases where this is likely can opt in to it separately without also using larger window sizes and the like.
3. I was actually more surprised by how little xz was affected by choice of base value at its default level, but the extent that it affected zstd seemed excessive. 
4. It seemed interesting that the best case for `xz -1` was also the worst case for `zstd -18`.

**Screenshots and charts**
None

**Desktop (please complete the following information):**

 - OS: Arch Linux in WSL2
 - Version: `*** zstd command line interface 64-bits v1.5.1, by Yann Collet ***`
 - Compiler/Flags: I used the default arch packages. I think they are using gcc -O2. 
 - Other relevant hardware specs: 8-core i9-9880H
 - Build system: pacman

**Additional context**
Add any other context about the problem here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor compression of binary numeric data with surprising results #3014

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

endian	base	zstd -18	xz -1	notes
Big	0	10.46% in 0.791s	6.47% in 0.272s	best result for zstd
Big	0x1234567	26.84% in 0.578s	6.38% in 0.263s
Little	0	67.21% in 0.532s	2.52% in 0.243s	Worst result for zstd, -18 is same compression as -1. Best result for xz, although default level compresses to 6.30%
Little	0x12345678	25.43% in 0.568ms	6.37% in 0.257	With a non-zero base, endianness doesn't matter

Poor compression of binary numeric data with surprising results #3014

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions