Skip to content

Commit e1ab691

Browse files
authored
Merge pull request #3514 from facebook/spec_huffman
Clarify zstd specification for Huffman blocks
2 parents 395a2c5 + 832f559 commit e1ab691

File tree

1 file changed

+16
-7
lines changed

1 file changed

+16
-7
lines changed

doc/zstd_compression_format.md

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Distribution of this document is unlimited.
1616

1717
### Version
1818

19-
0.3.7 (2020-12-09)
19+
0.3.8 (2023-02-18)
2020

2121

2222
Introduction
@@ -470,6 +470,7 @@ This field uses 2 lowest bits of first byte, describing 4 different block types
470470
repeated `Regenerated_Size` times.
471471
- `Compressed_Literals_Block` - This is a standard Huffman-compressed block,
472472
starting with a Huffman tree description.
473+
In this mode, there are at least 2 different literals represented in the Huffman tree description.
473474
See details below.
474475
- `Treeless_Literals_Block` - This is a Huffman-compressed block,
475476
using Huffman tree _from previous Huffman-compressed literals block_.
@@ -566,6 +567,7 @@ or from a dictionary.
566567

567568
### `Huffman_Tree_Description`
568569
This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`).
570+
The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256.
569571
The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description).
570572
The size of `Huffman_Tree_Description` is determined during decoding process,
571573
it must be used to determine where streams begin.
@@ -1197,7 +1199,7 @@ Huffman Coding
11971199
--------------
11981200
Zstandard Huffman-coded streams are read backwards,
11991201
similar to the FSE bitstreams.
1200-
Therefore, to find the start of the bitstream, it is therefore to
1202+
Therefore, to find the start of the bitstream, it is required to
12011203
know the offset of the last byte of the Huffman-coded stream.
12021204

12031205
After writing the last bit containing information, the compressor
@@ -1239,9 +1241,15 @@ Transformation from `Weight` to `Number_of_Bits` follows this formula :
12391241
```
12401242
Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0
12411243
```
1242-
The last symbol's `Weight` is deduced from previously decoded ones,
1243-
by completing to the nearest power of 2.
1244-
This power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
1244+
When a literal value is not present, it receives a `Weight` of 0.
1245+
The least frequent symbol receives a `Weight` of 1.
1246+
Consequently, the `Weight` 1 is necessarily present.
1247+
The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max).
1248+
The last symbol's `Weight` is deduced from previously retrieved Weights,
1249+
by completing to the nearest power of 2. It's necessarily non 0.
1250+
If it's not possible to reach a clean power of 2 with a single `Weight` value,
1251+
the Huffman Tree Description is considered invalid.
1252+
This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree.
12451253
`Max_Number_of_Bits` must be <= 11,
12461254
otherwise the representation is considered corrupted.
12471255

@@ -1254,7 +1262,7 @@ Let's presume the following Huffman tree must be described :
12541262

12551263
The tree depth is 4, since its longest elements uses 4 bits
12561264
(longest elements are the one with smallest frequency).
1257-
Value `5` will not be listed, as it can be determined from values for 0-4,
1265+
Literal value `5` will not be listed, as it can be determined from previous values 0-4,
12581266
nor will values above `5` as they are all 0.
12591267
Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`.
12601268
Weight formula is :
@@ -1274,7 +1282,7 @@ The `Weight` of `5` can be determined by advancing to the next power of 2.
12741282
The sum of `2^(Weight-1)` (excluding 0's) is :
12751283
`8 + 4 + 2 + 0 + 1 = 15`.
12761284
Nearest larger power of 2 value is 16.
1277-
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = 16-15 = 1`.
1285+
Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`.
12781286

12791287
#### Huffman Tree header
12801288

@@ -1683,6 +1691,7 @@ or at least provide a meaningful error code explaining for which reason it canno
16831691

16841692
Version changes
16851693
---------------
1694+
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
16861695
- 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878
16871696
- 0.3.6 : clarifications for Dictionary_ID
16881697
- 0.3.5 : clarifications for Block_Maximum_Size

0 commit comments

Comments
 (0)