Skip to content

Faster adaptive_p sampling#1165

Merged
ikawrakow merged 8 commits intomainfrom
ik/adaptive_p_2
Jan 19, 2026
Merged

Faster adaptive_p sampling#1165
ikawrakow merged 8 commits intomainfrom
ik/adaptive_p_2

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

This PR further optimizes adaptive_p sampling compared to PR #1161. For more context, see the discussion there.

To actually measure the time spent in the adaptive_p sampler, one needs to add up the time spent in all of its functions, not just the final sampling time, which is fast. This is done in this PR and also in #1161, but I also modified the main branch (not pushed here) to be able to compare.

Here the results of a quick experiment with Qwne3-30B-A3B-Q8_0, adaptive_p enabled, prompt "Give me an extended summary of the history of Bulgaria". We see a massive improvement between the main branch and #1161 (~17X), and an additional speedup of 2.4X in this PR.

Main branch

llama_print_timings:        load time =    3036.16 ms
llama_print_timings:      sample time =   26234.31 ms /  2617 runs   (   10.02 ms per token,    99.75 tokens per second)
llama_print_timings: prompt eval time =      69.45 ms /    19 tokens (    3.66 ms per token,   273.59 tokens per second)
llama_print_timings:        eval time =   16612.06 ms /  2616 runs   (    6.35 ms per token,   157.48 tokens per second)
llama_print_timings:       total time =   56298.48 ms /  2635 tokens

PR #1161

llama_print_timings:        load time =    2938.55 ms
llama_print_timings:      sample time =    1524.91 ms /  2617 runs   (    0.58 ms per token,  1716.16 tokens per second)
llama_print_timings: prompt eval time =      70.19 ms /    19 tokens (    3.69 ms per token,   270.69 tokens per second)
llama_print_timings:        eval time =   16584.15 ms /  2616 runs   (    6.34 ms per token,   157.74 tokens per second)
llama_print_timings:       total time =   33213.94 ms /  2635 tokens

This PR

ik/adaptive_p_2
llama_print_timings:        load time =    2954.32 ms
llama_print_timings:      sample time =     627.39 ms /  2667 runs   (    0.24 ms per token,  4250.91 tokens per second)
llama_print_timings: prompt eval time =      69.66 ms /    19 tokens (    3.67 ms per token,   272.75 tokens per second)
llama_print_timings:        eval time =   16910.88 ms /  2666 runs   (    6.34 ms per token,   157.65 tokens per second)
llama_print_timings:       total time =   38529.50 ms /  2685 tokens

@ikawrakow
Copy link
Copy Markdown
Owner Author

OK, I added AVX2 implementation of the probabilities. With that I get for the above test case

llama_print_timings:        load time =    2985.43 ms
llama_print_timings:      sample time =     476.95 ms /  2667 runs   (    0.18 ms per token,  5591.72 tokens per second)
llama_print_timings: prompt eval time =      70.72 ms /    19 tokens (    3.72 ms per token,   268.67 tokens per second)
llama_print_timings:        eval time =   16940.82 ms /  2666 runs   (    6.35 ms per token,   157.37 tokens per second)
llama_print_timings:       total time =   33074.71 ms /  2685 tokens

This is, somewhat disappointingly, only 0.58/0.18 = 3.2 times faster than #1161. But if we take into account the time spent in other samplers (0.07 ms in this example), the speedup compared to #1161 becomes (0.58 - 0.07)/(0.18 - 0.07) = 4.6 times. If I campare to the previous main branch, speedup is (10.02 - 0.07)/(0.18 - 0.07) = 90.5 times!

@ikawrakow ikawrakow merged commit 98b30e5 into main Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant