Skip to content

Commit a8b86d0

Browse files
committed
refactor documentation of the FSE decoding table build process
1 parent 75b0f5f commit a8b86d0

File tree

1 file changed

+100
-79
lines changed

1 file changed

+100
-79
lines changed

doc/zstd_compression_format.md

Lines changed: 100 additions & 79 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Distribution of this document is unlimited.
1616

1717
### Version
1818

19-
0.4.0 (2023-06-05)
19+
0.4.2 (2024-10-02)
2020

2121

2222
Introduction
@@ -1038,53 +1038,54 @@ and to compress Huffman headers.
10381038
FSE
10391039
---
10401040
FSE, short for Finite State Entropy, is an entropy codec based on [ANS].
1041-
FSE encoding/decoding involves a state that is carried over between symbols,
1042-
so decoding must be done in the opposite direction as encoding.
1041+
FSE encoding/decoding involves a state that is carried over between symbols.
1042+
Decoding must be done in the opposite direction as encoding.
10431043
Therefore, all FSE bitstreams are read from end to beginning.
10441044
Note that the order of the bits in the stream is not reversed,
1045-
we just read the elements in the reverse order they are written.
1045+
we just read each multi-bits element in the reverse order they are encoded.
10461046

10471047
For additional details on FSE, see [Finite State Entropy].
10481048

10491049
[Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/
10501050

1051-
FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
1051+
FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
10521052
`Symbol`, `Num_Bits`, and `Baseline`.
10531053
The `log2` of the table size is its `Accuracy_Log`.
10541054
An FSE state value represents an index in this table.
10551055

10561056
To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value.
1057-
The next symbol in the stream is the `Symbol` indicated in the table for that state.
1057+
The first symbol in the stream is the `Symbol` indicated in the table for that state.
10581058
To obtain the next state value,
10591059
the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`.
10601060

10611061
[ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
10621062

10631063
### FSE Table Description
1064-
To decode FSE streams, it is necessary to construct the decoding table.
1065-
The Zstandard format encodes FSE table descriptions as follows:
1064+
To decode an FSE bitstream, it is necessary to build its FSE decoding table.
1065+
The decoding table is derived from a distribution of Probabilities.
1066+
The Zstandard format encodes distributions of Probabilities as follows:
10661067

1067-
An FSE distribution table describes the probabilities of all symbols
1068-
from `0` to the last present one (included)
1069-
on a normalized scale of `1 << Accuracy_Log` .
1070-
Note that there must be two or more symbols with nonzero probability.
1071-
1072-
It's a bitstream which is read forward, in __little-endian__ fashion.
1073-
It's not necessary to know bitstream exact size,
1074-
it will be discovered and reported by the decoding process.
1068+
The distribution of probabilities is described in a bitstream which is read forward,
1069+
in __little-endian__ fashion.
1070+
The amount of bytes consumed from the bitstream to describe the distribution
1071+
is discovered at the end of the decoding process.
10751072

1076-
The bitstream starts by reporting on which scale it operates.
1073+
The bitstream starts by reporting on which scale the distribution operates.
10771074
Let's `low4Bits` designate the lowest 4 bits of the first byte :
10781075
`Accuracy_Log = low4bits + 5`.
10791076

1080-
Then follows each symbol value, from `0` to last present one.
1081-
The number of bits used by each field is variable.
1077+
An FSE distribution table describes the probabilities of all symbols
1078+
from `0` to the last present one (included) in natural order.
1079+
The sum of probabilities is normalized to reach a power of 2 total of `1 << Accuracy_Log` .
1080+
There must be two or more symbols with non-zero probabilities.
1081+
1082+
The number of bits used to decode each probability is variable.
10821083
It depends on :
10831084

10841085
- Remaining probabilities + 1 :
10851086
__example__ :
10861087
Presuming an `Accuracy_Log` of 8,
1087-
and presuming 100 probabilities points have already been distributed,
1088+
and presuming 100 probability points have already been distributed,
10881089
the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive).
10891090
Therefore, it may read up to `log2sup(157) == 8` bits, where `log2sup(N)`
10901091
is the smallest integer `T` that satisfies `(1 << T) > N`.
@@ -1098,115 +1099,133 @@ It depends on :
10981099
values from 98 to 157 use 8 bits.
10991100
This is achieved through this scheme :
11001101

1101-
| Value read | Value decoded | Number of bits used |
1102-
| ---------- | ------------- | ------------------- |
1103-
| 0 - 97 | 0 - 97 | 7 |
1104-
| 98 - 127 | 98 - 127 | 8 |
1105-
| 128 - 225 | 0 - 97 | 7 |
1106-
| 226 - 255 | 128 - 157 | 8 |
1102+
| 8-bit field read | Value decoded | Nb of bits consumed |
1103+
| ---------------- | ------------- | ------------------- |
1104+
| 0 - 97 | 0 - 97 | 7 |
1105+
| 98 - 127 | 98 - 127 | 8 |
1106+
| 128 - 225 | 0 - 97 | 7 |
1107+
| 226 - 255 | 128 - 157 | 8 |
11071108

1108-
Symbols probabilities are read one by one, in order.
1109+
Probability is derived from Value decoded using the following formula:
1110+
`Probality = Value - 1`
11091111

1110-
Probability is obtained from Value decoded by following formula :
1111-
`Proba = value - 1`
1112+
Consequently, a Probability of `0` is described by a Value `1`.
11121113

1113-
It means value `0` becomes negative probability `-1`.
1114-
`-1` is a special probability, which means "less than 1".
1115-
Its effect on distribution table is described in the [next section].
1116-
For the purpose of calculating total allocated probability points, it counts as one.
1114+
A Value `0` is used to signal a special case, named "Probability `-1`".
1115+
It describes a probability which should have been "less than 1".
1116+
Its effect on the decoding table building process is described in the [next section].
1117+
For the purpose of counting total allocated probability points, it counts as one.
11171118

11181119
[next section]:#from-normalized-distribution-to-decoding-tables
11191120

1120-
When a symbol has a __probability__ of `zero`,
1121+
Symbols probabilities are read one by one, in order.
1122+
After each probability is decoded, the total nb of probability points is updated.
1123+
This is used to dermine how many bits must be read to decode the probability of next symbol.
1124+
1125+
When a symbol has a __probability__ of `zero` (decoded from reading a Value `1`),
11211126
it is followed by a 2-bits repeat flag.
11221127
This repeat flag tells how many probabilities of zeroes follow the current one.
11231128
It provides a number ranging from 0 to 3.
11241129
If it is a 3, another 2-bits repeat flag follows, and so on.
11251130

1126-
When last symbol reaches cumulated total of `1 << Accuracy_Log`,
1127-
decoding is complete.
1128-
If this process results in a non-zero probability for a value outside of the
1129-
valid range of values that the FSE table is defined for, even if that value is
1130-
not used, then the data is considered corrupted. In the case of offset codes,
1131-
a decoder implementation may reject a frame containing a non-zero probability
1132-
for an offset code larger than the largest offset code supported by the decoder
1133-
implementation.
1131+
When the Probability for a symbol makes cumulated total reach `1 << Accuracy_Log`,
1132+
then it's the last symbol, and decoding is complete.
11341133

11351134
Then the decoder can tell how many bytes were used in this process,
11361135
and how many symbols are present.
11371136
The bitstream consumes a round number of bytes.
11381137
Any remaining bit within the last byte is just unused.
11391138

1139+
If this process results in a non-zero probability for a symbol outside of the
1140+
valid range of symbols that the FSE table is defined for, even if that symbol is
1141+
not used, then the data is considered corrupted.
1142+
For the specific case of offset codes,
1143+
a decoder implementation may reject a frame containing a non-zero probability
1144+
for an offset code larger than the largest offset code supported by the decoder
1145+
implementation.
1146+
11401147
#### From normalized distribution to decoding tables
11411148

1142-
The distribution of normalized probabilities is enough
1149+
The normalized distribution of probabilities is enough
11431150
to create a unique decoding table.
1144-
1145-
It follows the following build rule :
1151+
It is generated using the following build rule :
11461152

11471153
The table has a size of `Table_Size = 1 << Accuracy_Log`.
1148-
Each cell describes the symbol decoded,
1149-
and instructions to get the next state (`Number_of_Bits` and `Baseline`).
1154+
Each row specifies the decoded symbol,
1155+
and instructions to reach the next state (`Number_of_Bits` and `Baseline`).
11501156

1151-
Symbols are scanned in their natural order for "less than 1" probabilities.
1152-
Symbols with this probability are being attributed a single cell,
1157+
Symbols are first scanned in their natural order for "less than 1" probabilities
1158+
(previously decoded from a Value of `0`).
1159+
Symbols with this special probability are being attributed a single row,
11531160
starting from the end of the table and retreating.
11541161
These symbols define a full state reset, reading `Accuracy_Log` bits.
11551162

1156-
Then, all remaining symbols, sorted in natural order, are allocated cells.
1157-
Starting from symbol `0` (if it exists), and table position `0`,
1158-
each symbol gets allocated as many cells as its probability.
1159-
Cell allocation is spread, not linear :
1160-
each successor position follows this rule :
1163+
Then, all remaining symbols, sorted in natural order, are allocated rows.
1164+
Starting from smallest present symbol, and table position `0`,
1165+
each symbol gets allocated as many rows as its probability.
11611166

1167+
Row allocation is not linear, it follows this order, in modular arithmetic:
11621168
```
11631169
position += (tableSize>>1) + (tableSize>>3) + 3;
11641170
position &= tableSize-1;
11651171
```
11661172

1167-
A position is skipped if already occupied by a "less than 1" probability symbol.
1168-
`position` does not reset between symbols, it simply iterates through
1169-
each position in the table, switching to the next symbol when enough
1170-
states have been allocated to the current one.
1173+
Using above ordering rule, each symbol gets allocated as many rows as its probability.
1174+
If a position is already occupied by a "less than 1" probability symbol,
1175+
it is simply skipped, and the next position is allocated instead.
1176+
Once enough rows have been allocated for the current symbol,
1177+
the allocation process continues, using the next symbol, in natural order.
1178+
This process guarantees that the table is entirely and exactly filled.
11711179

1172-
The process guarantees that the table is entirely filled.
1173-
Each cell corresponds to a state value, which contains the symbol being decoded.
1180+
Each row specifies a decoded symbol, and is accessed by current state value.
1181+
It also specifies `Number_of_Bits` and `Baseline`, which are required to determine next state value.
11741182

1175-
To add the `Number_of_Bits` and `Baseline` required to retrieve next state,
1176-
it's first necessary to sort all occurrences of each symbol in state order.
1177-
Lower states will need 1 more bit than higher ones.
1178-
The process is repeated for each symbol.
1183+
To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
1184+
and then attribute N+1 bits to lower rows, and N bits to higher rows,
1185+
following the process described below (using an example):
11791186

11801187
__Example__ :
1181-
Presuming a symbol has a probability of 5,
1182-
it receives 5 cells, corresponding to 5 state values.
1183-
These state values are then sorted in natural order.
1188+
Presuming an `Accuracy_Log` of 7,
1189+
let's imagine a symbol with a Probability of 5:
1190+
it receives 5 rows, corresponding to 5 state values between `0` and `127`.
1191+
1192+
In this example, the first state value happens to be `1` (after unspecified previous symbols).
1193+
The next 4 states are then determined using above modular arithmetic rule,
1194+
which specifies to add `64+16+3 = 83` modulo `128` to jump to next position,
1195+
producing the following series: `1`, `84`, `39`, `122`, `77` (modular arithmetic).
1196+
(note: the next symbol will then start at `32`).
11841197

1185-
Next power of 2 after 5 is 8.
1186-
Space of probabilities must be divided into 8 equal parts.
1187-
Presuming the `Accuracy_Log` is 7, it defines a space of 128 states.
1188-
Divided by 8, each share is 16 large.
1198+
These state values are then sorted in natural order,
1199+
resulting in the following series: `1`, `39`, `77`, `84`, `122`.
11891200

1190-
In order to reach 8 shares, 8-5=3 lowest states will count "double",
1201+
The next power of 2 after 5 is 8.
1202+
Therefore, the probability space will be divided into 8 equal parts.
1203+
Since the probability space is `1<<7 = 128` large, each share is `128/8 = 16` large.
1204+
1205+
In order to reach 8 shares, the `8-5 = 3` lowest states will count "double",
11911206
doubling their shares (32 in width), hence requiring one more bit.
11921207

1193-
Baseline is assigned starting from the higher states using fewer bits,
1194-
increasing at each state, then resuming at the first state,
1195-
each state takes its allocated width from Baseline.
1208+
Baseline is assigned starting from the lowest state using fewer bits,
1209+
continuing in natural state order, looping back at the beginning.
1210+
Each state takes its allocated range from Baseline, sized by its `Number_of_Bits`.
11961211

11971212
| state order | 0 | 1 | 2 | 3 | 4 |
11981213
| ---------------- | ----- | ----- | ------ | ---- | ------ |
11991214
| state value | 1 | 39 | 77 | 84 | 122 |
12001215
| width | 32 | 32 | 32 | 16 | 16 |
12011216
| `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 |
1202-
| range number | 2 | 4 | 6 | 0 | 1 |
1217+
| allocation order | 3 | 4 | 5 | 1 | 2 |
12031218
| `Baseline` | 32 | 64 | 96 | 0 | 16 |
12041219
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
12051220

1206-
During decoding, the next state value is determined from current state value,
1207-
by reading the required `Number_of_Bits`, and adding the specified `Baseline`.
1221+
During decoding, the next state value is determined by using current state value as row number,
1222+
then reading the required `Number_of_Bits` from the bitstream, and adding the specified `Baseline`.
1223+
1224+
Note:
1225+
as a trivial example, it follows that, for a symbol with a Probability of `1`,
1226+
`Baseline` is necessarily `0`, and `Number_of_Bits` is necessarily `Accuracy_Log`.
12081227

1209-
See [Appendix A] for the results of this process applied to the default distributions.
1228+
See [Appendix A] to see the outcome of this process applied to the default distributions.
12101229

12111230
[Appendix A]: #appendix-a---decoding-tables-for-predefined-codes
12121231

@@ -1716,6 +1735,8 @@ or at least provide a meaningful error code explaining for which reason it canno
17161735

17171736
Version changes
17181737
---------------
1738+
- 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
1739+
- 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
17191740
- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
17201741
- 0.3.9 : clarifications for Huffman-compressed literal sizes.
17211742
- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.

0 commit comments

Comments
 (0)