Skip to content

Conversation

@Narsil
Copy link
Contributor

@Narsil Narsil commented Jun 16, 2025

Rebase of #1618
Similar improvements on aarch64 M3

I removed the public AHash interfact. It shouldn't be part of the public API imho (yes that means many copies, but AHash should stay internal so we can modify later).

Benchmarking bert-encode/WordPiece BERT encode: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 53.0s, or reduce sample count to 10.
bert-encode/WordPiece BERT encode
                        time:   [2.4283 s 2.4629 s 2.4996 s]
                        thrpt:  [2.4756 MiB/s 2.5126 MiB/s 2.5483 MiB/s]
                 change:
                        time:   [−10.096% −7.8618% −5.6204%] (p = 0.00 < 0.05)
                        thrpt:  [+5.9551% +8.5326% +11.230%]
                        Performance has improved.
Benchmarking bert-encode/WordPiece BERT encode batch: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 14.0s, or reduce sample count to 10.
bert-encode/WordPiece BERT encode batch
                        time:   [682.73 ms 695.67 ms 712.54 ms]
                        thrpt:  [8.6845 MiB/s 8.8951 MiB/s 9.0637 MiB/s]
                 change:
                        time:   [−20.705% −15.747% −11.501%] (p = 0.00 < 0.05)
                        thrpt:  [+12.995% +18.690% +26.112%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe

bert-train-small/WordPiece Train vocabulary (small)
                        time:   [16.811 ms 16.945 ms 17.219 ms]
                        thrpt:  [422.19 KiB/s 429.00 KiB/s 432.42 KiB/s]
                 change:
                        time:   [−19.603% −17.208% −15.049%] (p = 0.00 < 0.05)
                        thrpt:  [+17.715% +20.785% +24.384%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high mild

Benchmarking bert-train-big/WordPiece Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.0s.
bert-train-big/WordPiece Train vocabulary (big)
                        time:   [693.77 ms 725.44 ms 761.39 ms]
                        thrpt:  [8.1273 MiB/s 8.5300 MiB/s 8.9195 MiB/s]
                 change:
                        time:   [−23.882% −19.613% −14.933%] (p = 0.00 < 0.05)
                        thrpt:  [+17.555% +24.398% +31.375%]
                        Performance has improved.

     Running benches/bpe_benchmark.rs (target/release/deps/bpe_benchmark-49e383f79a782561)
Gnuplot not found, using plotters backend
Benchmarking bpe-encode/BPE GPT2 encode: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 42.5s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode
                        time:   [2.1310 s 2.1845 s 2.2545 s]
                        thrpt:  [2.7448 MiB/s 2.8327 MiB/s 2.9038 MiB/s]
                 change:
                        time:   [−10.569% −7.0124% −3.0404%] (p = 0.00 < 0.05)
                        thrpt:  [+3.1357% +7.5412% +11.818%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high severe
Benchmarking bpe-encode/BPE GPT2 encode batch: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 13.1s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode batch
                        time:   [691.33 ms 722.61 ms 756.43 ms]
                        thrpt:  [8.1806 MiB/s 8.5636 MiB/s 8.9509 MiB/s]
                 change:
                        time:   [−1.6564% +3.9238% +9.2610%] (p = 0.19 > 0.05)
                        thrpt:  [−8.4760% −3.7756% +1.6843%]
                        No change in performance detected.
Benchmarking bpe-encode/BPE GPT2 encode, no cache: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 63.6s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode, no cache
                        time:   [2.8375 s 2.9186 s 3.0156 s]
                        thrpt:  [2.0520 MiB/s 2.1202 MiB/s 2.1808 MiB/s]
                 change:
                        time:   [−20.187% −16.796% −13.055%] (p = 0.00 < 0.05)
                        thrpt:  [+15.016% +20.186% +25.293%]
                        Performance has improved.
Found 1 outliers among 20 measurements (5.00%)
  1 (5.00%) high mild
Benchmarking bpe-encode/BPE GPT2 encode batch, no cache: Warming up for 3.0000 s
Warning: Unable to complete 20 samples in 5.0s. You may wish to increase target time to 15.5s, or reduce sample count to 10.
bpe-encode/BPE GPT2 encode batch, no cache
                        time:   [725.56 ms 739.44 ms 755.25 ms]
                        thrpt:  [8.1934 MiB/s 8.3686 MiB/s 8.5286 MiB/s]
                 change:
                        time:   [−14.984% −12.535% −10.017%] (p = 0.00 < 0.05)
                        thrpt:  [+11.132% +14.332% +17.625%]
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  2 (10.00%) high mild

bpe-train-small/BPE Train vocabulary (small)
                        time:   [15.891 ms 16.001 ms 16.265 ms]
                        thrpt:  [446.94 KiB/s 454.31 KiB/s 457.46 KiB/s]
                 change:
                        time:   [−12.175% −10.302% −8.0192%] (p = 0.00 < 0.05)
                        thrpt:  [+8.7183% +11.485% +13.862%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe

Benchmarking bpe-train-large/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 7.4s.
bpe-train-large/BPE Train vocabulary (big)
                        time:   [689.49 ms 698.18 ms 707.34 ms]
                        thrpt:  [8.7484 MiB/s 8.8632 MiB/s 8.9748 MiB/s]
                 change:
                        time:   [−17.372% −15.491% −13.530%] (p = 0.00 < 0.05)
                        thrpt:  [+15.647% +18.331% +21.024%]
                        Performance has improved.

     Running benches/layout_benchmark.rs (target/release/deps/layout_benchmark-a6fe7afd38a14b28)
Gnuplot not found, using plotters backend
TemplateProcessing single encode
                        time:   [990.18 ns 999.44 ns 1.0108 µs]
                        change: [−9.0658% −4.0826% +1.2059%] (p = 0.13 > 0.05)
                        No change in performance detected.
Found 3 outliers among 20 measurements (15.00%)
  1 (5.00%) high mild
  2 (10.00%) high severe

TemplateProcessing pair encode
                        time:   [2.8234 µs 2.8585 µs 2.9072 µs]
                        change: [−23.722% −18.359% −11.946%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 20 measurements (10.00%)
  1 (5.00%) high mild
  1 (5.00%) high severe

     Running benches/llama3_benchmark.rs (target/release/deps/llama3_benchmark-ded5fd102cdc68a7)
Gnuplot not found, using plotters backend
Benchmarking llama3-encode/llama3-offsets: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 8.5s.
llama3-encode/llama3-offsets
                        time:   [748.54 ms 775.86 ms 812.84 ms]
                        thrpt:  [7.6129 MiB/s 7.9757 MiB/s 8.2669 MiB/s]
                 change:
                        time:   [−31.686% −27.760% −23.226%] (p = 0.00 < 0.05)
                        thrpt:  [+30.252% +38.427% +46.383%]
                        Performance has improved.
Found 1 outliers among 10 measurements (10.00%)
  1 (10.00%) high severe
Benchmarking llama3-encode/llama3-encode: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 23.3s.
llama3-encode/llama3-encode
                        time:   [2.3489 s 2.4322 s 2.5156 s]
                        thrpt:  [2.4598 MiB/s 2.5442 MiB/s 2.6345 MiB/s]
                 change:
                        time:   [−25.364% −22.444% −19.384%] (p = 0.00 < 0.05)
                        thrpt:  [+24.044% +28.939% +33.984%]
                        Performance has improved.
Benchmarking llama3-encode/llama3-batch: Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 6.8s.
llama3-encode/llama3-batch
                        time:   [659.48 ms 669.77 ms 680.18 ms]
                        thrpt:  [9.0977 MiB/s 9.2391 MiB/s 9.3832 MiB/s]
                 change:
                        time:   [−48.552% −43.991% −40.447%] (p = 0.00 < 0.05)
                        thrpt:  [+67.918% +78.544% +94.371%]
                        Performance has improved.
Found 3 outliers among 10 measurements (30.00%)
  2 (20.00%) low mild
  1 (10.00%) high mild
Benchmarking llama3-encode/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 11.3s.
llama3-encode/BPE Train vocabulary (big)
                        time:   [1.1127 s 1.1630 s 1.2157 s]
                        thrpt:  [5.0901 MiB/s 5.3208 MiB/s 5.5612 MiB/s]
                 change:
                        time:   [−18.982% −13.983% −8.7779%] (p = 0.00 < 0.05)
                        thrpt:  [+9.6225% +16.256% +23.429%]
                        Performance has improved.

     Running benches/unigram_benchmark.rs (target/release/deps/unigram_benchmark-38775ac8622e7ad7)
Gnuplot not found, using plotters backend
Benchmarking unigram-train-large/BPE Train vocabulary (big): Warming up for 3.0000 s
Warning: Unable to complete 10 samples in 5.0s. You may wish to increase target time to 12.2s.
unigram-train-large/BPE Train vocabulary (big)
                        time:   [1.1742 s 1.2173 s 1.2603 s]
                        thrpt:  [4.9102 MiB/s 5.0836 MiB/s 5.2700 MiB/s]
                 change:
                        time:   [−9.9761% −5.0169% −0.0522%] (p = 0.10 > 0.05)
                        thrpt:  [+0.0522% +5.2818% +11.082%]
                        No change in performance detected.

cargo bench  1552.15s user 126.69s system 351% cpu 7:58.04 total

@Narsil Narsil requested a review from McPatate June 16, 2025 19:51
@Narsil Narsil force-pushed the consolidated-optimization-ahash-dary-compact-str branch from d4ec430 to 3d4cfd3 Compare June 17, 2025 12:10
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Must have been quite tedious, thanks a lot !

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this to get a real world uses case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope just some random file leftover in my directory apparently, must have been issue fixing :)

@Narsil Narsil force-pushed the consolidated-optimization-ahash-dary-compact-str branch from 0cddfce to 0490240 Compare June 19, 2025 08:35
@Narsil Narsil merged commit be25814 into main Jun 19, 2025
30 checks passed
@Narsil Narsil deleted the consolidated-optimization-ahash-dary-compact-str branch June 19, 2025 09:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants