Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 16 additions & 7 deletions doc/zstd_compression_format.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Distribution of this document is unlimited.

### Version

0.3.7 (2020-12-09)
0.3.8 (2023-02-18)


Introduction
Expand Down Expand Up @@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types
repeated `Regenerated_Size` times.
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
starting with a Huffman tree description.
In this mode, there are at least 2 different literals represented in the Huffman tree description.
See details below.
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
using Huffman tree _from previous Huffman-compressed literals block_.
Expand Down Expand Up @@ -566,6 +567,7 @@ or from a dictionary.

### `Huffman_Tree_Description`
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
The size of `Huffman_Tree_Description` is determined during decoding process,
it must be used to determine where streams begin.
Expand Down Expand Up @@ -1197,7 +1199,7 @@ Huffman Coding
--------------
Zstandard Huffman-coded streams are read backwards,
similar to the FSE bitstreams.
Therefore, to find the start of the bitstream, it is therefore to
Therefore, to find the start of the bitstream, it is required to
know the offset of the last byte of the Huffman-coded stream.

After writing the last bit containing information, the compressor
Expand Down Expand Up @@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula :
```
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
```
The last symbol's `Weight` is deduced from previously decoded ones,
by completing to the nearest power of 2.
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
When a literal value is not present, it receives a `Weight` of 0.
The least frequent symbol receives a `Weight` of 1.
Consequently, the `Weight` 1 is necessarily present.
The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
The last symbol's `Weight` is deduced from previously retrieved Weights,
by completing to the nearest power of 2. It's necessarily non 0.
If it's not possible to reach a clean power of 2 with a single `Weight` value,
the Huffman Tree Description is considered invalid.
This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
`Max_Number_of_Bits` must be <= 11,
otherwise the representation is considered corrupted.

Expand All @@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described :

The tree depth is 4, since its longest elements uses 4 bits
(longest elements are the one with smallest frequency).
Value `5` will not be listed, as it can be determined from values for 0-4,
Literal value `5` will not be listed, as it can be determined from previous values 0-4,
nor will values above `5` as they are all 0.
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
Weight formula is :
Expand All @@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2.
The sum of `2^(Weight-1)` (excluding 0's) is :
`8 + 4 + 2 + 0 + 1 = 15`.
Nearest larger power of 2 value is 16.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.

#### Huffman Tree header

Expand Down Expand Up @@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno

Version changes
---------------
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
- 0.3.6 : clarifications for Dictionary_ID
- 0.3.5 : clarifications for Block_Maximum_Size
Expand Down