llama-bench : add test measuring token generation rate at given prompt length by fairydreaming · Pull Request #11126 · ggml-org/llama.cpp

fairydreaming · 2025-01-07T16:44:21Z

I needed a test that would measure token generation rate after processing a prompt of given length, so I decided to add a new kind of test to the llama-bench tool.

This PR adds -gp <pp,tg> option that allows to specify a prompt length and number of tokens generated after processing the prompt. This new test works almost the same way as old -pg test, but it doesn't take into account the prompt length and prompt processing time when calculating result, only the token generation rate is reported.

Test results are labeled in a different way to avoid confusion with -pg test results, I used @ character to emphasize that the result indicates the token generation rate AT given prompt length.

Example:

$ ./bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf -p 0 -n 0 -gp 128,32 -gp 256,32 -r 3

model	size	params	backend	threads	test	t/s
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp128	8.94 ± 0.06
deepseek2 671B Q4_K - Small	353.90 GiB	671.03 B	CPU	32	tg32@pp256	8.35 ± 0.01

Hopefully this is more intuitive compared to averaged prompt processing + token generation rate in -pg test results.

… given prompt length

slaren

The other printers (sql, json, etc) would also need to be updated.

fairydreaming · 2025-01-08T13:42:47Z

The other printers (sql, json, etc) would also need to be updated.

@slaren Can you be more specific?

slaren · 2025-01-08T13:47:01Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

fairydreaming · 2025-01-08T14:05:33Z

The test type needs to be exported in these printers as well, since n_prompt and n_gen is no longer enough to tell the difference. Another option would be to get rid of test_kind_type and just record and report timings for the prompt and generation steps separately, then the -pg test would be enough.

I guess another option is to add a "test" column in all printers with the same values as displayed in default console output. Any specific reason it's not included there?

slaren · 2025-01-08T14:11:16Z

Yes, that's what I meant when I said that the test type would need to be exported in these printers. There isn't a test column/field at the moment because it is not necessary.

slaren · 2025-01-17T00:10:10Z

examples/llama-bench/llama-bench.cpp

+        switch (test_kind) {
+            case TEST_KIND_PP:
+                snprintf(buf, sizeof(buf), "pp%d", n_prompt);
+                break;
+            case TEST_KIND_TG:
+                snprintf(buf, sizeof(buf), "tg%d", n_gen);
+                break;
+            case TEST_KIND_PG:
+                snprintf(buf, sizeof(buf), "pp%d+tg%d", n_prompt, n_gen);
+                break;
+            case TEST_KIND_GP:
+                snprintf(buf, sizeof(buf), "tg%d@pp%d", n_gen, n_prompt);
+                break;
+            default:
+                snprintf(buf, sizeof(buf), "unknown");
+                break;
+        }


This formatting should only be applied to the markdown printer. The other printers are intended to be used programmatically, so it should be a simple enum that can be parsed easily, without the token counts. The token counts can be obtained from the n_prompt and n_gen parameters already.

fairydreaming · 2026-01-06T13:35:19Z

This seems to be already present added in a more general form (context depth) in #13096, so I'm closing this.

llama-bench : add -gp <pp,tg> test measuring token generation rate at…

bb6569e

… given prompt length

github-actions bot added the examples label Jan 7, 2025

llama-bench : whitespace formatting

1c69b0e

slaren reviewed Jan 8, 2025

View reviewed changes

llama-bench : add "test" field with test label in all output formats

ae86ff3

fairydreaming requested a review from slaren January 13, 2025 18:14

slaren reviewed Jan 17, 2025

View reviewed changes

fairydreaming mentioned this pull request Jan 27, 2025

Interleave 8 rows (Q8_0, IQ4_XS) ikawrakow/ik_llama.cpp#178

Merged

ikawrakow mentioned this pull request Jan 29, 2025

Various ikawrakow/ik_llama.cpp#181

Merged

city96 mentioned this pull request Feb 1, 2025

Eval bug: -sm row performance on NVidia multy-gpu config is extremely low on the long contexts after b3990 #11510

Closed

fairydreaming mentioned this pull request Feb 4, 2025

NUMA-aware KV cache buffer type (experimental) #11580

Draft

fairydreaming closed this Jan 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama-bench : add test measuring token generation rate at given prompt length#11126

llama-bench : add test measuring token generation rate at given prompt length#11126
fairydreaming wants to merge 3 commits intoggml-org:masterfrom
fairydreaming:llama-bench-gp

fairydreaming commented Jan 7, 2025

Uh oh!

slaren left a comment

Uh oh!

fairydreaming commented Jan 8, 2025

Uh oh!

slaren commented Jan 8, 2025

Uh oh!

fairydreaming commented Jan 8, 2025

Uh oh!

slaren commented Jan 8, 2025

Uh oh!

slaren Jan 17, 2025

Uh oh!

fairydreaming commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fairydreaming commented Jan 7, 2025

Uh oh!

slaren left a comment

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented Jan 8, 2025

Uh oh!

slaren commented Jan 8, 2025

Uh oh!

fairydreaming commented Jan 8, 2025

Uh oh!

slaren commented Jan 8, 2025

Uh oh!

slaren Jan 17, 2025

Choose a reason for hiding this comment

Uh oh!

fairydreaming commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants