Skip to content

higher zstd compression level resulting in larger compressed data #3793

@revintec

Description

@revintec

using zstd 1.5.5, latest version as of writing
prepare an int array(each int occupies 4 bytes, little endian)
[0,30,60,90,...] 65536 ints, 65536*4 bytes
then compress it using various compression levels(simple compression, no dict):

2023-10-17 02:14:22.792 TRACE original size: 262144
2023-10-17 02:14:22.851 TRACE level 0 204162
2023-10-17 02:14:22.852 TRACE level 1 204103
2023-10-17 02:14:22.852 TRACE level 2 204118
2023-10-17 02:14:22.853 TRACE level 3 204162
2023-10-17 02:14:22.854 TRACE level 4 204136
2023-10-17 02:14:22.856 TRACE level 5 204147
2023-10-17 02:14:22.858 TRACE level 6 204141
2023-10-17 02:14:22.860 TRACE level 7 204161
2023-10-17 02:14:22.862 TRACE level 8 204161
2023-10-17 02:14:22.863 TRACE level 9 204161
2023-10-17 02:14:22.865 TRACE level 10 204161
2023-10-17 02:14:22.868 TRACE level 11 204165
2023-10-17 02:14:22.871 TRACE level 12 204161
2023-10-17 02:14:22.877 TRACE level 13 204143
2023-10-17 02:14:22.893 TRACE level 14 83240
2023-10-17 02:14:22.907 TRACE level 15 83240
2023-10-17 02:14:22.923 TRACE level 16 83242
2023-10-17 02:14:22.940 TRACE level 17 83242
2023-10-17 02:14:22.958 TRACE level 18 142849
2023-10-17 02:14:22.976 TRACE level 19 142849
2023-10-17 02:14:22.998 TRACE level 20 142849
2023-10-17 02:14:23.017 TRACE level 21 142849
2023-10-17 02:14:23.035 TRACE level 22 142849

as seen from the above output, higher compression level(18) starts resulting in larger compressed data

  1. is that in line with exceptions? I thought higher compression level should resulting in smaller compressed data, this one is over 70% larger. how can I produce the smallest output data(ignoring compress time and/or memory consumption)?
    -- a search using compression level size in issues results in no relative information in the first page, nor relative result in google :( sorry if this has already been brought up

and there's a related questions I'm putting into a same issue(forgive me :)

  1. the input data is relatively simple(low entropy), why isn't it compressed more? is there any tweak/flags that I should enable? the original data is a tsdb timestamp series, and I'd like not to change them(rearranging bytes or manually do delta compression), is there a recommend way to handle semi arithmetic progression/sequence case? (n.b. the delta is not always the same, it maybe 30,30,30,300,300,30,3600,3600,86400,30,30

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions