Skip to content

llama-bench: enable having different number of threads for tg and pp#284

Merged
ikawrakow merged 2 commits intomainfrom
ik/llama_bench_tgb
Mar 25, 2025
Merged

llama-bench: enable having different number of threads for tg and pp#284
ikawrakow merged 2 commits intomainfrom
ik/llama_bench_tgb

Conversation

@ikawrakow
Copy link
Copy Markdown
Owner

All applications in the examples folder except llama-bench accept -t (to specify number of threads for token generation) and -tb (to specify number of threads for prompt processing, a.k.a. prefill) as command line arguments. This is handy because often TG peak performance is reached at a lower number of threads, so one wants to use that instead of the number of cores, which is good for maximum prompt processing speed. llama-bench, inherited from upstream, has its own command line argument parsing, where one only has available -t but not -tb.

This PR adds a new command line argument to llama-bench: -tgb (or --threads-gen-batch). One can use it as, e.g.,

./bin/llama-bench -tgb 4,16 -p 512 -n 128 other_arguments

where 4 threads will be used for the tg128 test, and 16 threads will be used for the pp512 test. For tests that are a combination of prefill and gen (-pg, -gp), the batch number of threads will be used for prefill, and the gen number of threads will be used for token generation. One can also specify multiple pairs of {t_gen, t_batch} for the -tgb argument, separating them with a semicolon. E.g.,

./bin/llama-bench -tgb 2,16;4,16;8,32

The -t argument continues to work as before. It adds a pair of the same integer in the list of {t_hen, t_batch} number of thread pairs.

Caveat: For -p the batch number of threads is added to the table. For all other tests the gen number of threads is printed. This is of course appropriate for -n and -gp, but it becomes confusing for -pg, where the batch and gen number of threads both matter for the reported performance. I guess, it would be better to print both thread numbers in this case, but this is not done in this PR.

@ikawrakow ikawrakow merged commit a22250d into main Mar 25, 2025
@ubergarm
Copy link
Copy Markdown
Contributor

Thanks for this one, should help optimize the big xeon 6980P given previous testing suggests that pp likes more threads than tg.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants