Consolidated optimization ahash dary compact str #1799

Narsil · 2025-06-16T19:51:42Z

Rebase of #1618
Similar improvements on aarch64 M3

I removed the public AHash interfact. It shouldn't be part of the public API imho (yes that means many copies, but AHash should stay internal so we can modify later).

Benchmarking bert-encode/WordPiece BERT encode: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 53.0s, or reduce sample count to 10.
bert-encode/WordPiece BERT encode
                        time:   [2.4283 s 2.4629 s 2.4996 s]
                        thrpt:  [2.4756 MiB/s 2.5126 MiB/s 2.5483 MiB/s]
                 change:
                        time:   [−10.096% −7.8618% −5.6204%] (p = 0.00 < 0.05)
                        thrpt:  [+5.9551% +8.5326% +11.230%]
                        Performance has improved.
Benchmarking bert-encode/WordPiece BERT encode batch: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 14.0s, or reduce sample count to 10.
bert-encode/WordPiece BERT encode batch
                        time:   [682.73 ms 695.67 ms 712.54 ms]
                        thrpt:  [8.6845 MiB/s 8.8951 MiB/s 9.0637 MiB/s]
                 change:
                        time:   [−20.705% −15.747% −11.501%] (p = 0.00 < 0.05)
                        thrpt:  [+12.995% +18.690% +26.112%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

bert-train-small/WordPiece Train vocabulary (small)
                        time:   [16.811 ms 16.945 ms 17.219 ms]
                        thrpt:  [422.19 KiB/s 429.00 KiB/s 432.42 KiB/s]
                 change:
                        time:   [−19.603% −17.208% −15.049%] (p = 0.00 < 0.05)
                        thrpt:  [+17.715% +20.785% +24.384%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking bert-train-big/WordPiece Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.0s.
bert-train-big/WordPiece Train vocabulary (big)
                        time:   [693.77 ms 725.44 ms 761.39 ms]
                        thrpt:  [8.1273 MiB/s 8.5300 MiB/s 8.9195 MiB/s]
                 change:
                        time:   [−23.882% −19.613% −14.933%] (p = 0.00 < 0.05)
                        thrpt:  [+17.555% +24.398% +31.375%]
                        Performance has improved.

     Running benches/bpe_benchmark.rs (target/release/deps/bpe_benchmark-49e383f79a782561)
Gnuplot not found, using plotters backend
Benchmarking bpe-encode/BPE GPT2 encode: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 42.5s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode
                        time:   [2.1310 s 2.1845 s 2.2545 s]
                        thrpt:  [2.7448 MiB/s 2.8327 MiB/s 2.9038 MiB/s]
                 change:
                        time:   [−10.569% −7.0124% −3.0404%] (p = 0.00 < 0.05)
                        thrpt:  [+3.1357% +7.5412% +11.818%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Benchmarking bpe-encode/BPE GPT2 encode batch: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 13.1s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode batch
                        time:   [691.33 ms 722.61 ms 756.43 ms]
                        thrpt:  [8.1806 MiB/s 8.5636 MiB/s 8.9509 MiB/s]
                 change:
                        time:   [−1.6564% +3.9238% +9.2610%] (p = 0.19 > 0.05)
                        thrpt:  [−8.4760% −3.7756% +1.6843%]
                        No change in performance detected.
Benchmarking bpe-encode/BPE GPT2 encode, no cache: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 63.6s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode, no cache
                        time:   [2.8375 s 2.9186 s 3.0156 s]
                        thrpt:  [2.0520 MiB/s 2.1202 MiB/s 2.1808 MiB/s]
                 change:
                        time:   [−20.187% −16.796% −13.055%] (p = 0.00 < 0.05)
                        thrpt:  [+15.016% +20.186% +25.293%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking bpe-encode/BPE GPT2 encode batch, no cache: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 15.5s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode batch, no cache
                        time:   [725.56 ms 739.44 ms 755.25 ms]
                        thrpt:  [8.1934 MiB/s 8.3686 MiB/s 8.5286 MiB/s]
                 change:
                        time:   [−14.984% −12.535% −10.017%] (p = 0.00 < 0.05)
                        thrpt:  [+11.132% +14.332% +17.625%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

bpe-train-small/BPE Train vocabulary (small)
                        time:   [15.891 ms 16.001 ms 16.265 ms]
                        thrpt:  [446.94 KiB/s 454.31 KiB/s 457.46 KiB/s]
                 change:
                        time:   [−12.175% −10.302% −8.0192%] (p = 0.00 < 0.05)
                        thrpt:  [+8.7183% +11.485% +13.862%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking bpe-train-large/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.4s.
bpe-train-large/BPE Train vocabulary (big)
                        time:   [689.49 ms 698.18 ms 707.34 ms]
                        thrpt:  [8.7484 MiB/s 8.8632 MiB/s 8.9748 MiB/s]
                 change:
                        time:   [−17.372% −15.491% −13.530%] (p = 0.00 < 0.05)
                        thrpt:  [+15.647% +18.331% +21.024%]
                        Performance has improved.

     Running benches/layout_benchmark.rs (target/release/deps/layout_benchmark-a6fe7afd38a14b28)
Gnuplot not found, using plotters backend
TemplateProcessing single encode
                        time:   [990.18 ns 999.44 ns 1.0108 µs]
                        change: [−9.0658% −4.0826% +1.2059%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe

TemplateProcessing pair encode
                        time:   [2.8234 µs 2.8585 µs 2.9072 µs]
                        change: [−23.722% −18.359% −11.946%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe

     Running benches/llama3_benchmark.rs (target/release/deps/llama3_benchmark-ded5fd102cdc68a7)
Gnuplot not found, using plotters backend
Benchmarking llama3-encode/llama3-offsets: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.5s.
llama3-encode/llama3-offsets
                        time:   [748.54 ms 775.86 ms 812.84 ms]
                        thrpt:  [7.6129 MiB/s 7.9757 MiB/s 8.2669 MiB/s]
                 change:
                        time:   [−31.686% −27.760% −23.226%] (p = 0.00 < 0.05)
                        thrpt:  [+30.252% +38.427% +46.383%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking llama3-encode/llama3-encode: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 23.3s.
llama3-encode/llama3-encode
                        time:   [2.3489 s 2.4322 s 2.5156 s]
                        thrpt:  [2.4598 MiB/s 2.5442 MiB/s 2.6345 MiB/s]
                 change:
                        time:   [−25.364% −22.444% −19.384%] (p = 0.00 < 0.05)
                        thrpt:  [+24.044% +28.939% +33.984%]
                        Performance has improved.
Benchmarking llama3-encode/llama3-batch: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.8s.
llama3-encode/llama3-batch
                        time:   [659.48 ms 669.77 ms 680.18 ms]
                        thrpt:  [9.0977 MiB/s 9.2391 MiB/s 9.3832 MiB/s]
                 change:
                        time:   [−48.552% −43.991% −40.447%] (p = 0.00 < 0.05)
                        thrpt:  [+67.918% +78.544% +94.371%]
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high mild
Benchmarking llama3-encode/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.3s.
llama3-encode/BPE Train vocabulary (big)
                        time:   [1.1127 s 1.1630 s 1.2157 s]
                        thrpt:  [5.0901 MiB/s 5.3208 MiB/s 5.5612 MiB/s]
                 change:
                        time:   [−18.982% −13.983% −8.7779%] (p = 0.00 < 0.05)
                        thrpt:  [+9.6225% +16.256% +23.429%]
                        Performance has improved.

     Running benches/unigram_benchmark.rs (target/release/deps/unigram_benchmark-38775ac8622e7ad7)
Gnuplot not found, using plotters backend
Benchmarking unigram-train-large/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.2s.
unigram-train-large/BPE Train vocabulary (big)
                        time:   [1.1742 s 1.2173 s 1.2603 s]
                        thrpt:  [4.9102 MiB/s 5.0836 MiB/s 5.2700 MiB/s]
                 change:
                        time:   [−9.9761% −5.0169% −0.0522%] (p = 0.10 > 0.05)
                        thrpt:  [+0.0522% +5.2818% +11.082%]
                        No change in performance detected.

cargo bench  1552.15s user 126.69s system 351% cpu 7:58.04 total

HuggingFaceDocBuilderDev · 2025-06-17T12:52:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Must have been quite tedious, thanks a lot !

ArthurZucker · 2025-06-19T03:30:54Z

bindings/python/test.py

is this to get a real world uses case?

Nope just some random file leftover in my directory apparently, must have been issue fixing :)

…bindings broken without library refactor)

Narsil requested a review from McPatate June 16, 2025 19:51

Narsil force-pushed the consolidated-optimization-ahash-dary-compact-str branch from d4ec430 to 3d4cfd3 Compare June 17, 2025 12:10

ArthurZucker approved these changes Jun 19, 2025

View reviewed changes

Narsil added 7 commits June 19, 2025 10:35

free speed/mem optimizations with ahash, dary_heap, and compact_str (…

02d0f0a

…bindings broken without library refactor)

Rebased ahash.

c911f61

Removing data files.

55795ab

Fixing (dummily) the python pyobject conversion.

991fbb6

Fixing the surface by not providing ahash.

f427951

Fixing the python side with the ahash public api removal.

51af63f

Cleanup.

0490240

Narsil force-pushed the consolidated-optimization-ahash-dary-compact-str branch from 0cddfce to 0490240 Compare June 19, 2025 08:35

Narsil added 4 commits June 19, 2025 10:37

Removing test file.

23873ff

Remove dead file.

56ee295

Bad conflict resolution.

337e0e8

Bad merge 2.

d82f7e5

Narsil merged commit be25814 into main Jun 19, 2025
30 checks passed

Narsil deleted the consolidated-optimization-ahash-dary-compact-str branch June 19, 2025 09:45

This was referenced Jun 21, 2025

Switch to FXHash #1752

Closed

[WIP] free speed/mem optimizations with ahash, dary_heap, and compact_str #1618

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consolidated optimization ahash dary compact str #1799

Consolidated optimization ahash dary compact str #1799

Uh oh!

Narsil commented Jun 16, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 17, 2025

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Jun 19, 2025

Uh oh!

Narsil Jun 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Consolidated optimization ahash dary compact str #1799

Consolidated optimization ahash dary compact str #1799

Uh oh!

Conversation

Narsil commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jun 17, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Narsil Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Narsil commented Jun 16, 2025 •

edited

Loading