Commit 60e376b
authored
huff0: Assembler improvements (#736)
Main changes:
* Compute out[id * dstEvery + i] statically. This shaves four
instructions off the main loops. (It also frees up a register.)
* Track "exhausted" by addition instead or OR. This gets rid of an
additional instruction. The variable is now also zeroed inside the
loop as a dependency hint.
Benchmark results show small speedups on some datasets:
```
name old speed new speed delta
Decompress1XTable/digits-8 350MB/s ± 0% 350MB/s ± 1% ~ (p=0.764 n=10+9)
Decompress1XTable/gettysburg-8 270MB/s ± 1% 268MB/s ± 1% -0.72% (p=0.001 n=10+10)
Decompress1XTable/twain-8 329MB/s ± 1% 328MB/s ± 0% ~ (p=0.035 n=10+9)
Decompress1XTable/low-ent.10k-8 387MB/s ± 1% 386MB/s ± 0% ~ (p=0.027 n=10+8)
Decompress1XTable/superlow-ent-10k-8 377MB/s ± 0% 375MB/s ± 0% -0.48% (p=0.000 n=10+10)
Decompress1XTable/crash2-8 17.0MB/s ± 0% 16.9MB/s ± 0% -0.36% (p=0.004 n=9+10)
Decompress1XTable/endzerobits-8 53.3MB/s ± 0% 53.0MB/s ± 0% -0.55% (p=0.000 n=10+9)
Decompress1XTable/endnonzero-8 11.3MB/s ± 0% 11.3MB/s ± 1% ~ (p=0.060 n=10+10)
Decompress1XTable/case1-8 22.0MB/s ± 0% 21.9MB/s ± 1% ~ (p=0.015 n=9+9)
Decompress1XTable/case2-8 18.1MB/s ± 1% 18.1MB/s ± 1% ~ (p=0.202 n=10+9)
Decompress1XTable/case3-8 19.1MB/s ± 1% 19.2MB/s ± 1% ~ (p=0.056 n=9+10)
Decompress1XTable/pngdata.001-8 374MB/s ± 0% 374MB/s ± 0% ~ (p=0.148 n=10+10)
Decompress1XTable/normcount2-8 54.4MB/s ± 1% 54.4MB/s ± 1% ~ (p=0.617 n=10+10)
Decompress1XNoTable/digits/100-8 280MB/s ± 0% 280MB/s ± 1% ~ (p=0.951 n=9+10)
Decompress1XNoTable/digits/10000-8 366MB/s ± 1% 367MB/s ± 0% ~ (p=0.090 n=10+9)
Decompress1XNoTable/digits/262143-8 348MB/s ± 1% 349MB/s ± 0% ~ (p=0.043 n=10+10)
Decompress1XNoTable/gettysburg/100-8 276MB/s ± 0% 277MB/s ± 1% +0.44% (p=0.009 n=10+10)
Decompress1XNoTable/gettysburg/10000-8 363MB/s ± 1% 363MB/s ± 0% ~ (p=0.041 n=10+7)
Decompress1XNoTable/gettysburg/262143-8 349MB/s ± 1% 350MB/s ± 0% ~ (p=0.123 n=10+10)
Decompress1XNoTable/twain/100-8 267MB/s ± 0% 268MB/s ± 0% ~ (p=0.052 n=10+10)
Decompress1XNoTable/twain/10000-8 357MB/s ± 3% 363MB/s ± 0% +1.74% (p=0.000 n=10+10)
Decompress1XNoTable/twain/262143-8 320MB/s ± 2% 329MB/s ± 0% +3.09% (p=0.000 n=10+10)
Decompress1XNoTable/low-ent.10k/100-8 183MB/s ± 1% 184MB/s ± 0% ~ (p=0.211 n=9+10)
Decompress1XNoTable/low-ent.10k/10000-8 377MB/s ± 3% 385MB/s ± 1% +2.14% (p=0.000 n=10+10)
Decompress1XNoTable/low-ent.10k/262143-8 386MB/s ± 1% 389MB/s ± 1% +0.84% (p=0.005 n=10+10)
Decompress1XNoTable/superlow-ent-10k/262143-8 382MB/s ± 2% 389MB/s ± 1% +1.89% (p=0.001 n=10+10)
Decompress1XNoTable/crash2/100-8 276MB/s ± 2% 278MB/s ± 0% ~ (p=0.180 n=10+8)
Decompress1XNoTable/crash2/10000-8 373MB/s ± 1% 374MB/s ± 1% ~ (p=0.315 n=10+10)
Decompress1XNoTable/crash2/262143-8 373MB/s ± 1% 375MB/s ± 0% ~ (p=0.165 n=10+8)
Decompress1XNoTable/endzerobits/100-8 184MB/s ± 0% 184MB/s ± 1% ~ (p=0.845 n=9+9)
Decompress1XNoTable/endzerobits/10000-8 384MB/s ± 1% 386MB/s ± 0% +0.61% (p=0.007 n=10+10)
Decompress1XNoTable/endzerobits/262143-8 387MB/s ± 2% 389MB/s ± 0% ~ (p=0.963 n=9+8)
Decompress1XNoTable/endnonzero/100-8 181MB/s ± 2% 183MB/s ± 0% ~ (p=0.017 n=9+10)
Decompress1XNoTable/endnonzero/10000-8 385MB/s ± 0% 382MB/s ± 1% -0.88% (p=0.001 n=8+10)
Decompress1XNoTable/endnonzero/262143-8 387MB/s ± 1% 385MB/s ± 2% ~ (p=0.143 n=10+10)
Decompress1XNoTable/case1/100-8 278MB/s ± 2% 282MB/s ± 1% ~ (p=0.013 n=10+9)
Decompress1XNoTable/case1/10000-8 373MB/s ± 1% 373MB/s ± 0% ~ (p=0.274 n=10+8)
Decompress1XNoTable/case1/262143-8 374MB/s ± 1% 374MB/s ± 0% ~ (p=0.589 n=10+9)
Decompress1XNoTable/case2/100-8 274MB/s ± 0% 274MB/s ± 0% -0.26% (p=0.002 n=10+9)
Decompress1XNoTable/case2/10000-8 378MB/s ± 0% 377MB/s ± 0% ~ (p=0.093 n=10+10)
Decompress1XNoTable/case2/262143-8 377MB/s ± 1% 376MB/s ± 1% ~ (p=0.225 n=10+10)
Decompress1XNoTable/case3/100-8 266MB/s ± 0% 265MB/s ± 0% -0.20% (p=0.007 n=10+9)
Decompress1XNoTable/case3/10000-8 371MB/s ± 0% 372MB/s ± 0% ~ (p=0.211 n=10+9)
Decompress1XNoTable/case3/262143-8 373MB/s ± 0% 374MB/s ± 0% ~ (p=0.073 n=10+10)
Decompress1XNoTable/pngdata.001/100-8 239MB/s ± 0% 239MB/s ± 0% ~ (p=0.889 n=9+10)
Decompress1XNoTable/pngdata.001/10000-8 384MB/s ± 0% 384MB/s ± 0% ~ (p=0.228 n=10+8)
Decompress1XNoTable/pngdata.001/262143-8 377MB/s ± 0% 379MB/s ± 0% +0.56% (p=0.000 n=10+10)
Decompress1XNoTable/normcount2/100-8 281MB/s ± 1% 282MB/s ± 1% ~ (p=0.015 n=10+10)
Decompress1XNoTable/normcount2/10000-8 368MB/s ± 0% 370MB/s ± 0% +0.37% (p=0.004 n=10+10)
Decompress1XNoTable/normcount2/262143-8 371MB/s ± 0% 371MB/s ± 0% ~ (p=0.034 n=8+10)
Decompress4XNoTable/digits/100-8 200MB/s ± 1% 201MB/s ± 0% ~ (p=0.274 n=8+10)
Decompress4XNoTable/digits/10000-8 603MB/s ± 0% 622MB/s ± 1% +3.20% (p=0.000 n=8+10)
Decompress4XNoTable/digits/262143-8 578MB/s ± 0% 595MB/s ± 1% +2.87% (p=0.000 n=8+10)
Decompress4XNoTable/gettysburg/100-8 260MB/s ± 0% 260MB/s ± 1% ~ (p=0.011 n=8+10)
Decompress4XNoTable/gettysburg/10000-8 643MB/s ± 0% 657MB/s ± 1% +2.19% (p=0.000 n=10+9)
Decompress4XNoTable/gettysburg/262143-8 572MB/s ± 0% 589MB/s ± 0% +2.93% (p=0.000 n=8+10)
Decompress4XNoTable/twain/100-8 206MB/s ± 1% 206MB/s ± 1% ~ (p=0.436 n=10+10)
Decompress4XNoTable/twain/10000-8 639MB/s ± 1% 653MB/s ± 1% +2.25% (p=0.000 n=10+10)
Decompress4XNoTable/twain/262143-8 516MB/s ± 0% 522MB/s ± 1% +1.09% (p=0.004 n=10+10)
Decompress4XNoTable/low-ent.10k/100-8 207MB/s ± 1% 207MB/s ± 0% ~ (p=1.000 n=10+9)
Decompress4XNoTable/low-ent.10k/10000-8 631MB/s ± 0% 653MB/s ± 0% +3.42% (p=0.000 n=10+9)
Decompress4XNoTable/low-ent.10k/262143-8 685MB/s ± 1% 696MB/s ± 0% +1.61% (p=0.000 n=10+10)
Decompress4XNoTable/superlow-ent-10k/262143-8 684MB/s ± 1% 695MB/s ± 1% +1.51% (p=0.000 n=9+10)
Decompress4XNoTable/case1/100-8 208MB/s ± 1% 207MB/s ± 0% ~ (p=0.353 n=10+10)
Decompress4XNoTable/case1/10000-8 601MB/s ± 0% 621MB/s ± 1% +3.22% (p=0.000 n=10+10)
Decompress4XNoTable/case1/262143-8 613MB/s ± 1% 632MB/s ± 0% +3.14% (p=0.000 n=10+10)
Decompress4XNoTable/case2/100-8 210MB/s ± 2% 208MB/s ± 2% ~ (p=0.315 n=10+9)
Decompress4XNoTable/case2/10000-8 618MB/s ± 0% 636MB/s ± 0% +2.95% (p=0.000 n=10+10)
Decompress4XNoTable/case2/262143-8 635MB/s ± 0% 651MB/s ± 0% +2.56% (p=0.000 n=7+10)
Decompress4XNoTable/case3/100-8 199MB/s ± 1% 200MB/s ± 1% ~ (p=0.055 n=10+10)
Decompress4XNoTable/case3/10000-8 615MB/s ± 0% 633MB/s ± 1% +2.94% (p=0.000 n=10+10)
Decompress4XNoTable/case3/262143-8 620MB/s ± 0% 639MB/s ± 1% +3.00% (p=0.000 n=10+10)
Decompress4XNoTable/pngdata.001/100-8 212MB/s ± 0% 211MB/s ± 1% ~ (p=0.211 n=10+9)
Decompress4XNoTable/pngdata.001/10000-8 649MB/s ± 0% 667MB/s ± 1% +2.76% (p=0.000 n=10+10)
Decompress4XNoTable/pngdata.001/262143-8 646MB/s ± 0% 660MB/s ± 0% +2.28% (p=0.000 n=9+10)
Decompress4XNoTable/normcount2/100-8 261MB/s ± 1% 262MB/s ± 1% ~ (p=0.031 n=9+9)
Decompress4XNoTable/normcount2/10000-8 589MB/s ± 1% 613MB/s ± 0% +3.99% (p=0.000 n=10+9)
Decompress4XNoTable/normcount2/262143-8 585MB/s ± 3% 617MB/s ± 1% +5.57% (p=0.000 n=10+10)
Decompress4XNoTableTableLog8/digits-8 579MB/s ± 2% 610MB/s ± 0% +5.33% (p=0.000 n=10+10)
Decompress4XTable/digits-8 584MB/s ± 1% 607MB/s ± 1% +3.89% (p=0.000 n=10+10)
Decompress4XTable/gettysburg-8 370MB/s ± 0% 373MB/s ± 1% +0.67% (p=0.009 n=10+10)
Decompress4XTable/twain-8 512MB/s ± 2% 523MB/s ± 1% +2.08% (p=0.000 n=9+10)
Decompress4XTable/low-ent.10k-8 656MB/s ± 1% 677MB/s ± 1% +3.21% (p=0.000 n=10+10)
Decompress4XTable/superlow-ent-10k-8 603MB/s ± 4% 626MB/s ± 1% +3.91% (p=0.000 n=9+10)
Decompress4XTable/case1-8 21.1MB/s ± 0% 21.0MB/s ± 0% -0.55% (p=0.000 n=9+9)
Decompress4XTable/case2-8 17.6MB/s ± 0% 17.6MB/s ± 1% ~ (p=0.736 n=9+10)
Decompress4XTable/case3-8 18.7MB/s ± 1% 18.7MB/s ± 1% ~ (p=0.642 n=10+10)
Decompress4XTable/pngdata.001-8 648MB/s ± 0% 657MB/s ± 0% +1.50% (p=0.000 n=10+8)
Decompress4XTable/normcount2-8 49.7MB/s ± 1% 49.7MB/s ± 1% ~ (p=0.839 n=10+10)
[Geo mean] 271MB/s 274MB/s +0.96%
```1 parent 272358c commit 60e376b
2 files changed
+340
-350
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
50 | | - | |
51 | | - | |
52 | | - | |
| 50 | + | |
| 51 | + | |
53 | 52 | | |
54 | 53 | | |
55 | | - | |
56 | 54 | | |
57 | | - | |
58 | 55 | | |
59 | 56 | | |
60 | 57 | | |
| |||
64 | 61 | | |
65 | 62 | | |
66 | 63 | | |
67 | | - | |
| 64 | + | |
68 | 65 | | |
69 | 66 | | |
70 | 67 | | |
| |||
74 | 71 | | |
75 | 72 | | |
76 | 73 | | |
77 | | - | |
78 | | - | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
79 | 78 | | |
80 | 79 | | |
81 | | - | |
82 | | - | |
83 | | - | |
84 | | - | |
85 | | - | |
86 | | - | |
87 | | - | |
| 80 | + | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
88 | 84 | | |
89 | | - | |
| 85 | + | |
90 | 86 | | |
91 | | - | |
| 87 | + | |
92 | 88 | | |
93 | 89 | | |
94 | 90 | | |
95 | 91 | | |
96 | 92 | | |
97 | | - | |
| 93 | + | |
98 | 94 | | |
99 | 95 | | |
100 | 96 | | |
| |||
105 | 101 | | |
106 | 102 | | |
107 | 103 | | |
108 | | - | |
109 | | - | |
| 104 | + | |
110 | 105 | | |
111 | 106 | | |
112 | 107 | | |
113 | 108 | | |
114 | 109 | | |
115 | 110 | | |
116 | | - | |
| 111 | + | |
117 | 112 | | |
118 | 113 | | |
119 | 114 | | |
| |||
149 | 144 | | |
150 | 145 | | |
151 | 146 | | |
152 | | - | |
| 147 | + | |
153 | 148 | | |
154 | 149 | | |
155 | 150 | | |
| |||
163 | 158 | | |
164 | 159 | | |
165 | 160 | | |
166 | | - | |
167 | | - | |
168 | | - | |
169 | | - | |
| 161 | + | |
| 162 | + | |
170 | 163 | | |
171 | 164 | | |
172 | 165 | | |
173 | | - | |
174 | 166 | | |
175 | 167 | | |
176 | 168 | | |
| |||
180 | 172 | | |
181 | 173 | | |
182 | 174 | | |
183 | | - | |
| 175 | + | |
184 | 176 | | |
185 | 177 | | |
186 | 178 | | |
| |||
190 | 182 | | |
191 | 183 | | |
192 | 184 | | |
193 | | - | |
194 | | - | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
195 | 189 | | |
196 | | - | |
197 | | - | |
198 | | - | |
199 | | - | |
200 | | - | |
201 | | - | |
202 | | - | |
203 | | - | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
204 | 195 | | |
205 | | - | |
| 196 | + | |
206 | 197 | | |
207 | | - | |
| 198 | + | |
208 | 199 | | |
209 | 200 | | |
210 | 201 | | |
211 | 202 | | |
212 | 203 | | |
213 | | - | |
| 204 | + | |
214 | 205 | | |
215 | 206 | | |
216 | 207 | | |
| |||
219 | 210 | | |
220 | 211 | | |
221 | 212 | | |
222 | | - | |
| 213 | + | |
223 | 214 | | |
224 | 215 | | |
225 | 216 | | |
| |||
253 | 244 | | |
254 | 245 | | |
255 | 246 | | |
256 | | - | |
| 247 | + | |
257 | 248 | | |
258 | 249 | | |
259 | 250 | | |
260 | 251 | | |
261 | 252 | | |
262 | 253 | | |
263 | 254 | | |
| 255 | + | |
| 256 | + | |
| 257 | + | |
| 258 | + | |
| 259 | + | |
| 260 | + | |
| 261 | + | |
| 262 | + | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
264 | 270 | | |
265 | 271 | | |
266 | 272 | | |
| |||
297 | 303 | | |
298 | 304 | | |
299 | 305 | | |
300 | | - | |
| 306 | + | |
301 | 307 | | |
302 | | - | |
303 | | - | |
304 | | - | |
| 308 | + | |
| 309 | + | |
| 310 | + | |
305 | 311 | | |
306 | 312 | | |
307 | 313 | | |
| |||
409 | 415 | | |
410 | 416 | | |
411 | 417 | | |
412 | | - | |
| 418 | + | |
413 | 419 | | |
414 | 420 | | |
415 | 421 | | |
| |||
432 | 438 | | |
433 | 439 | | |
434 | 440 | | |
435 | | - | |
| 441 | + | |
436 | 442 | | |
437 | 443 | | |
438 | 444 | | |
| |||
474 | 480 | | |
475 | 481 | | |
476 | 482 | | |
477 | | - | |
| 483 | + | |
478 | 484 | | |
479 | 485 | | |
480 | 486 | | |
| |||
0 commit comments