llama-bench: enable having different number of threads for tg and pp#284
Merged
llama-bench: enable having different number of threads for tg and pp#284
Conversation
Contributor
|
Thanks for this one, should help optimize the big xeon 6980P given previous testing suggests that pp likes more threads than tg. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All applications in the
examplesfolder exceptllama-benchaccept-t(to specify number of threads for token generation) and-tb(to specify number of threads for prompt processing, a.k.a. prefill) as command line arguments. This is handy because often TG peak performance is reached at a lower number of threads, so one wants to use that instead of the number of cores, which is good for maximum prompt processing speed.llama-bench, inherited from upstream, has its own command line argument parsing, where one only has available-tbut not-tb.This PR adds a new command line argument to
llama-bench:-tgb(or--threads-gen-batch). One can use it as, e.g.,where 4 threads will be used for the
tg128test, and 16 threads will be used for thepp512test. For tests that are a combination of prefill and gen (-pg,-gp), the batch number of threads will be used for prefill, and the gen number of threads will be used for token generation. One can also specify multiple pairs of{t_gen, t_batch}for the-tgbargument, separating them with a semicolon. E.g.,The
-targument continues to work as before. It adds a pair of the same integer in the list of{t_hen, t_batch}number of thread pairs.Caveat: For
-pthe batch number of threads is added to the table. For all other tests the gen number of threads is printed. This is of course appropriate for-nand-gp, but it becomes confusing for-pg, where the batch and gen number of threads both matter for the reported performance. I guess, it would be better to print both thread numbers in this case, but this is not done in this PR.