llama-bench : add test measuring token generation rate at given prompt length#11126
llama-bench : add test measuring token generation rate at given prompt length#11126fairydreaming wants to merge 3 commits intoggml-org:masterfrom
Conversation
… given prompt length
slaren
left a comment
There was a problem hiding this comment.
The other printers (sql, json, etc) would also need to be updated.
@slaren Can you be more specific? |
|
The test type needs to be exported in these printers as well, since |
I guess another option is to add a "test" column in all printers with the same values as displayed in default console output. Any specific reason it's not included there? |
|
Yes, that's what I meant when I said that the test type would need to be exported in these printers. There isn't a test column/field at the moment because it is not necessary. |
| switch (test_kind) { | ||
| case TEST_KIND_PP: | ||
| snprintf(buf, sizeof(buf), "pp%d", n_prompt); | ||
| break; | ||
| case TEST_KIND_TG: | ||
| snprintf(buf, sizeof(buf), "tg%d", n_gen); | ||
| break; | ||
| case TEST_KIND_PG: | ||
| snprintf(buf, sizeof(buf), "pp%d+tg%d", n_prompt, n_gen); | ||
| break; | ||
| case TEST_KIND_GP: | ||
| snprintf(buf, sizeof(buf), "tg%d@pp%d", n_gen, n_prompt); | ||
| break; | ||
| default: | ||
| snprintf(buf, sizeof(buf), "unknown"); | ||
| break; | ||
| } |
There was a problem hiding this comment.
This formatting should only be applied to the markdown printer. The other printers are intended to be used programmatically, so it should be a simple enum that can be parsed easily, without the token counts. The token counts can be obtained from the n_prompt and n_gen parameters already.
|
This seems to be already present added in a more general form (context depth) in #13096, so I'm closing this. |
I needed a test that would measure token generation rate after processing a prompt of given length, so I decided to add a new kind of test to the llama-bench tool.
This PR adds
-gp <pp,tg>option that allows to specify a prompt length and number of tokens generated after processing the prompt. This new test works almost the same way as old-pgtest, but it doesn't take into account the prompt length and prompt processing time when calculating result, only the token generation rate is reported.Test results are labeled in a different way to avoid confusion with -pg test results, I used @ character to emphasize that the result indicates the token generation rate AT given prompt length.
Example:
$ ./bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf -p 0 -n 0 -gp 128,32 -gp 256,32 -r 3Hopefully this is more intuitive compared to averaged prompt processing + token generation rate in
-pgtest results.