@@ -16,7 +16,7 @@ Distribution of this document is unlimited.
1616
1717### Version
1818
19- 0.4.0 (2023-06-05 )
19+ 0.4.2 (2024-10-02 )
2020
2121
2222Introduction
@@ -1038,53 +1038,54 @@ and to compress Huffman headers.
10381038FSE
10391039---
10401040FSE, short for Finite State Entropy, is an entropy codec based on [ ANS] .
1041- FSE encoding/decoding involves a state that is carried over between symbols,
1042- so decoding must be done in the opposite direction as encoding.
1041+ FSE encoding/decoding involves a state that is carried over between symbols.
1042+ Decoding must be done in the opposite direction as encoding.
10431043Therefore, all FSE bitstreams are read from end to beginning.
10441044Note that the order of the bits in the stream is not reversed,
1045- we just read the elements in the reverse order they are written .
1045+ we just read each multi-bits element in the reverse order they are encoded .
10461046
10471047For additional details on FSE, see [ Finite State Entropy] .
10481048
10491049[ Finite State Entropy ] :https://github.com/Cyan4973/FiniteStateEntropy/
10501050
1051- FSE decoding involves a decoding table which has a power of 2 size, and contain three elements:
1051+ FSE decoding is directed by a decoding table with a power of 2 size, each row containing three elements:
10521052` Symbol ` , ` Num_Bits ` , and ` Baseline ` .
10531053The ` log2 ` of the table size is its ` Accuracy_Log ` .
10541054An FSE state value represents an index in this table.
10551055
10561056To obtain the initial state value, consume ` Accuracy_Log ` bits from the stream as a __ little-endian__ value.
1057- The next symbol in the stream is the ` Symbol ` indicated in the table for that state.
1057+ The first symbol in the stream is the ` Symbol ` indicated in the table for that state.
10581058To obtain the next state value,
10591059the decoder should consume ` Num_Bits ` bits from the stream as a __ little-endian__ value and add it to ` Baseline ` .
10601060
10611061[ ANS ] : https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems
10621062
10631063### FSE Table Description
1064- To decode FSE streams, it is necessary to construct the decoding table.
1065- The Zstandard format encodes FSE table descriptions as follows:
1064+ To decode an FSE bitstream, it is necessary to build its FSE decoding table.
1065+ The decoding table is derived from a distribution of Probabilities.
1066+ The Zstandard format encodes distributions of Probabilities as follows:
10661067
1067- An FSE distribution table describes the probabilities of all symbols
1068- from ` 0 ` to the last present one (included)
1069- on a normalized scale of ` 1 << Accuracy_Log ` .
1070- Note that there must be two or more symbols with nonzero probability.
1071-
1072- It's a bitstream which is read forward, in __ little-endian__ fashion.
1073- It's not necessary to know bitstream exact size,
1074- it will be discovered and reported by the decoding process.
1068+ The distribution of probabilities is described in a bitstream which is read forward,
1069+ in __ little-endian__ fashion.
1070+ The amount of bytes consumed from the bitstream to describe the distribution
1071+ is discovered at the end of the decoding process.
10751072
1076- The bitstream starts by reporting on which scale it operates.
1073+ The bitstream starts by reporting on which scale the distribution operates.
10771074Let's ` low4Bits ` designate the lowest 4 bits of the first byte :
10781075` Accuracy_Log = low4bits + 5 ` .
10791076
1080- Then follows each symbol value, from ` 0 ` to last present one.
1081- The number of bits used by each field is variable.
1077+ An FSE distribution table describes the probabilities of all symbols
1078+ from ` 0 ` to the last present one (included) in natural order.
1079+ The sum of probabilities is normalized to reach a power of 2 total of ` 1 << Accuracy_Log ` .
1080+ There must be two or more symbols with non-zero probabilities.
1081+
1082+ The number of bits used to decode each probability is variable.
10821083It depends on :
10831084
10841085- Remaining probabilities + 1 :
10851086 __ example__ :
10861087 Presuming an ` Accuracy_Log ` of 8,
1087- and presuming 100 probabilities points have already been distributed,
1088+ and presuming 100 probability points have already been distributed,
10881089 the decoder may read any value from ` 0 ` to ` 256 - 100 + 1 == 157 ` (inclusive).
10891090 Therefore, it may read up to ` log2sup(157) == 8 ` bits, where ` log2sup(N) `
10901091 is the smallest integer ` T ` that satisfies ` (1 << T) > N ` .
@@ -1098,115 +1099,133 @@ It depends on :
10981099 values from 98 to 157 use 8 bits.
10991100 This is achieved through this scheme :
11001101
1101- | Value read | Value decoded | Number of bits used |
1102- | ---------- | ------------- | ------------------- |
1103- | 0 - 97 | 0 - 97 | 7 |
1104- | 98 - 127 | 98 - 127 | 8 |
1105- | 128 - 225 | 0 - 97 | 7 |
1106- | 226 - 255 | 128 - 157 | 8 |
1102+ | 8-bit field read | Value decoded | Nb of bits consumed |
1103+ | ---------------- | ------------- | ------------------- |
1104+ | 0 - 97 | 0 - 97 | 7 |
1105+ | 98 - 127 | 98 - 127 | 8 |
1106+ | 128 - 225 | 0 - 97 | 7 |
1107+ | 226 - 255 | 128 - 157 | 8 |
11071108
1108- Symbols probabilities are read one by one, in order.
1109+ Probability is derived from Value decoded using the following formula:
1110+ ` Probality = Value - 1 `
11091111
1110- Probability is obtained from Value decoded by following formula :
1111- ` Proba = value - 1 `
1112+ Consequently, a Probability of ` 0 ` is described by a Value ` 1 ` .
11121113
1113- It means value ` 0 ` becomes negative probability ` -1 ` .
1114- ` -1 ` is a special probability, which means "less than 1".
1115- Its effect on distribution table is described in the [ next section] .
1116- For the purpose of calculating total allocated probability points, it counts as one.
1114+ A Value ` 0 ` is used to signal a special case, named "Probability ` -1 ` " .
1115+ It describes a probability which should have been "less than 1".
1116+ Its effect on the decoding table building process is described in the [ next section] .
1117+ For the purpose of counting total allocated probability points, it counts as one.
11171118
11181119[ next section ] :#from-normalized-distribution-to-decoding-tables
11191120
1120- When a symbol has a __ probability__ of ` zero ` ,
1121+ Symbols probabilities are read one by one, in order.
1122+ After each probability is decoded, the total nb of probability points is updated.
1123+ This is used to dermine how many bits must be read to decode the probability of next symbol.
1124+
1125+ When a symbol has a __ probability__ of ` zero ` (decoded from reading a Value ` 1 ` ),
11211126it is followed by a 2-bits repeat flag.
11221127This repeat flag tells how many probabilities of zeroes follow the current one.
11231128It provides a number ranging from 0 to 3.
11241129If it is a 3, another 2-bits repeat flag follows, and so on.
11251130
1126- When last symbol reaches cumulated total of ` 1 << Accuracy_Log ` ,
1127- decoding is complete.
1128- If this process results in a non-zero probability for a value outside of the
1129- valid range of values that the FSE table is defined for, even if that value is
1130- not used, then the data is considered corrupted. In the case of offset codes,
1131- a decoder implementation may reject a frame containing a non-zero probability
1132- for an offset code larger than the largest offset code supported by the decoder
1133- implementation.
1131+ When the Probability for a symbol makes cumulated total reach ` 1 << Accuracy_Log ` ,
1132+ then it's the last symbol, and decoding is complete.
11341133
11351134Then the decoder can tell how many bytes were used in this process,
11361135and how many symbols are present.
11371136The bitstream consumes a round number of bytes.
11381137Any remaining bit within the last byte is just unused.
11391138
1139+ If this process results in a non-zero probability for a symbol outside of the
1140+ valid range of symbols that the FSE table is defined for, even if that symbol is
1141+ not used, then the data is considered corrupted.
1142+ For the specific case of offset codes,
1143+ a decoder implementation may reject a frame containing a non-zero probability
1144+ for an offset code larger than the largest offset code supported by the decoder
1145+ implementation.
1146+
11401147#### From normalized distribution to decoding tables
11411148
1142- The distribution of normalized probabilities is enough
1149+ The normalized distribution of probabilities is enough
11431150to create a unique decoding table.
1144-
1145- It follows the following build rule :
1151+ It is generated using the following build rule :
11461152
11471153The table has a size of ` Table_Size = 1 << Accuracy_Log ` .
1148- Each cell describes the symbol decoded,
1149- and instructions to get the next state (` Number_of_Bits ` and ` Baseline ` ).
1154+ Each row specifies the decoded symbol ,
1155+ and instructions to reach the next state (` Number_of_Bits ` and ` Baseline ` ).
11501156
1151- Symbols are scanned in their natural order for "less than 1" probabilities.
1152- Symbols with this probability are being attributed a single cell,
1157+ Symbols are first scanned in their natural order for "less than 1" probabilities
1158+ (previously decoded from a Value of ` 0 ` ).
1159+ Symbols with this special probability are being attributed a single row,
11531160starting from the end of the table and retreating.
11541161These symbols define a full state reset, reading ` Accuracy_Log ` bits.
11551162
1156- Then, all remaining symbols, sorted in natural order, are allocated cells.
1157- Starting from symbol ` 0 ` (if it exists), and table position ` 0 ` ,
1158- each symbol gets allocated as many cells as its probability.
1159- Cell allocation is spread, not linear :
1160- each successor position follows this rule :
1163+ Then, all remaining symbols, sorted in natural order, are allocated rows.
1164+ Starting from smallest present symbol, and table position ` 0 ` ,
1165+ each symbol gets allocated as many rows as its probability.
11611166
1167+ Row allocation is not linear, it follows this order, in modular arithmetic:
11621168```
11631169position += (tableSize>>1) + (tableSize>>3) + 3;
11641170position &= tableSize-1;
11651171```
11661172
1167- A position is skipped if already occupied by a "less than 1" probability symbol.
1168- ` position ` does not reset between symbols, it simply iterates through
1169- each position in the table, switching to the next symbol when enough
1170- states have been allocated to the current one.
1173+ Using above ordering rule, each symbol gets allocated as many rows as its probability.
1174+ If a position is already occupied by a "less than 1" probability symbol,
1175+ it is simply skipped, and the next position is allocated instead.
1176+ Once enough rows have been allocated for the current symbol,
1177+ the allocation process continues, using the next symbol, in natural order.
1178+ This process guarantees that the table is entirely and exactly filled.
11711179
1172- The process guarantees that the table is entirely filled .
1173- Each cell corresponds to a state value , which contains the symbol being decoded .
1180+ Each row specifies a decoded symbol, and is accessed by current state value .
1181+ It also specifies ` Number_of_Bits ` and ` Baseline ` , which are required to determine next state value .
11741182
1175- To add the ` Number_of_Bits ` and ` Baseline ` required to retrieve next state,
1176- it's first necessary to sort all occurrences of each symbol in state order.
1177- Lower states will need 1 more bit than higher ones.
1178- The process is repeated for each symbol.
1183+ To correctly set these fields, it's necessary to sort all occurrences of each symbol in state value order,
1184+ and then attribute N+1 bits to lower rows, and N bits to higher rows,
1185+ following the process described below (using an example):
11791186
11801187__ Example__ :
1181- Presuming a symbol has a probability of 5,
1182- it receives 5 cells, corresponding to 5 state values.
1183- These state values are then sorted in natural order.
1188+ Presuming an ` Accuracy_Log ` of 7,
1189+ let's imagine a symbol with a Probability of 5:
1190+ it receives 5 rows, corresponding to 5 state values between ` 0 ` and ` 127 ` .
1191+
1192+ In this example, the first state value happens to be ` 1 ` (after unspecified previous symbols).
1193+ The next 4 states are then determined using above modular arithmetic rule,
1194+ which specifies to add ` 64+16+3 = 83 ` modulo ` 128 ` to jump to next position,
1195+ producing the following series: ` 1 ` , ` 84 ` , ` 39 ` , ` 122 ` , ` 77 ` (modular arithmetic).
1196+ (note: the next symbol will then start at ` 32 ` ).
11841197
1185- Next power of 2 after 5 is 8.
1186- Space of probabilities must be divided into 8 equal parts.
1187- Presuming the ` Accuracy_Log ` is 7, it defines a space of 128 states.
1188- Divided by 8, each share is 16 large.
1198+ These state values are then sorted in natural order,
1199+ resulting in the following series: ` 1 ` , ` 39 ` , ` 77 ` , ` 84 ` , ` 122 ` .
11891200
1190- In order to reach 8 shares, 8-5=3 lowest states will count "double",
1201+ The next power of 2 after 5 is 8.
1202+ Therefore, the probability space will be divided into 8 equal parts.
1203+ Since the probability space is ` 1<<7 = 128 ` large, each share is ` 128/8 = 16 ` large.
1204+
1205+ In order to reach 8 shares, the ` 8-5 = 3 ` lowest states will count "double",
11911206doubling their shares (32 in width), hence requiring one more bit.
11921207
1193- Baseline is assigned starting from the higher states using fewer bits,
1194- increasing at each state, then resuming at the first state,
1195- each state takes its allocated width from Baseline.
1208+ Baseline is assigned starting from the lowest state using fewer bits,
1209+ continuing in natural state order, looping back at the beginning.
1210+ Each state takes its allocated range from Baseline, sized by its ` Number_of_Bits ` .
11961211
11971212| state order | 0 | 1 | 2 | 3 | 4 |
11981213| ---------------- | ----- | ----- | ------ | ---- | ------ |
11991214| state value | 1 | 39 | 77 | 84 | 122 |
12001215| width | 32 | 32 | 32 | 16 | 16 |
12011216| ` Number_of_Bits ` | 5 | 5 | 5 | 4 | 4 |
1202- | range number | 2 | 4 | 6 | 0 | 1 |
1217+ | allocation order | 3 | 4 | 5 | 1 | 2 |
12031218| ` Baseline ` | 32 | 64 | 96 | 0 | 16 |
12041219| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
12051220
1206- During decoding, the next state value is determined from current state value,
1207- by reading the required ` Number_of_Bits ` , and adding the specified ` Baseline ` .
1221+ During decoding, the next state value is determined by using current state value as row number,
1222+ then reading the required ` Number_of_Bits ` from the bitstream, and adding the specified ` Baseline ` .
1223+
1224+ Note:
1225+ as a trivial example, it follows that, for a symbol with a Probability of ` 1 ` ,
1226+ ` Baseline ` is necessarily ` 0 ` , and ` Number_of_Bits ` is necessarily ` Accuracy_Log ` .
12081227
1209- See [ Appendix A] for the results of this process applied to the default distributions.
1228+ See [ Appendix A] to see the outcome of this process applied to the default distributions.
12101229
12111230[ Appendix A ] : #appendix-a---decoding-tables-for-predefined-codes
12121231
@@ -1716,6 +1735,8 @@ or at least provide a meaningful error code explaining for which reason it canno
17161735
17171736Version changes
17181737---------------
1738+ - 0.4.2 : refactor FSE table construction process, inspired by Donald Pian
1739+ - 0.4.1 : clarifications on a few error scenarios, by Eric Lasota
17191740- 0.4.0 : fixed imprecise behavior for nbSeq==0, detected by Igor Pavlov
17201741- 0.3.9 : clarifications for Huffman-compressed literal sizes.
17211742- 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions.
0 commit comments