Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
c05df17
working llama-eval mc and math suite
gatbontonpc Jan 11, 2026
c2d83ca
multi source llama-eval
gatbontonpc Jan 12, 2026
89cab3d
Add readme
gatbontonpc Jan 12, 2026
8839037
add checkpointing
gatbontonpc Jan 16, 2026
07d5e1e
examples: add llama-server simulator for testing eval scripts
ggerganov Jan 31, 2026
23d4e21
examples: refactor test-simulator.sh for better readability
ggerganov Jan 31, 2026
c87af1d
docs: update llama-eval-discussion.md with session work summary
ggerganov Jan 31, 2026
5cc2258
examples: add simplified llama-eval-new.py for AIME evaluation
ggerganov Jan 31, 2026
a80814e
docs: remove README.md from llama-eval
ggerganov Jan 31, 2026
5a1be6c
examples: implement flexible grader system for answer validation
ggerganov Jan 31, 2026
9453f9d
examples: use HF_HUB_OFFLINE to avoid HF Hub warnings
ggerganov Jan 31, 2026
87f8930
examples: remove HF_HUB_OFFLINE to allow dataset download
ggerganov Jan 31, 2026
c2619c1
examples: use cached dataset path to avoid HF Hub requests
ggerganov Jan 31, 2026
04f6872
examples: use cached dataset path in simulator to avoid HF Hub requests
ggerganov Jan 31, 2026
37b26ca
docs: update llama-eval-discussion.md with session work summary
ggerganov Jan 31, 2026
62b04ce
examples: add threading support and model parameter to llama-eval-new.py
ggerganov Jan 31, 2026
a939f4c
docs: update llama-eval-discussion.md with threading and model parame…
ggerganov Jan 31, 2026
e79e8d0
examples: add task summary table to llama-eval-new.py
ggerganov Jan 31, 2026
812ae13
eval : print progress
ggerganov Jan 31, 2026
fb1481d
eval : add prompts
ggerganov Jan 31, 2026
9695e6f
test : fix path
ggerganov Feb 2, 2026
8156d54
sim : fix answer matching
ggerganov Feb 2, 2026
fd90796
eval : support multiple dataset runs
ggerganov Feb 2, 2026
68dde88
minor
ggerganov Feb 15, 2026
d2b1030
improve grader
ggerganov Feb 15, 2026
7751ae2
docs
ggerganov Feb 15, 2026
1db8428
remove old files
ggerganov Feb 15, 2026
e8a8075
datasets : add gsm8k
ggerganov Feb 15, 2026
cffd268
add gpqa + sampling + docs
ggerganov Feb 15, 2026
73e61d5
rename
ggerganov Feb 16, 2026
f762a71
grader : improve example answers
ggerganov Feb 16, 2026
c631565
cont
ggerganov Feb 16, 2026
99e3c3d
datasets : add aime2025
ggerganov Feb 16, 2026
52759bf
grader : update prompt
ggerganov Feb 16, 2026
db10dda
grade : improve regex + logs
ggerganov Feb 16, 2026
350e7c1
datasets : fix aime2025
ggerganov Feb 16, 2026
de956a6
cleanup
ggerganov Feb 16, 2026
c6d70b9
add AGENTS.md
ggerganov Feb 16, 2026
ad3a54e
ignore errors
ggerganov Feb 16, 2026
e6e777c
resume eval
ggerganov Feb 16, 2026
60a501e
cleanup
ggerganov Feb 16, 2026
7b84af8
fix counts
ggerganov Feb 16, 2026
6c41664
simplify
ggerganov Feb 16, 2026
e2e998a
fix prompts
ggerganov Feb 16, 2026
013963c
add html
ggerganov Feb 16, 2026
9c29be1
store full response
ggerganov Feb 16, 2026
2ffa45e
add tokens
ggerganov Feb 16, 2026
7f04986
resoning and error handling
ggerganov Feb 16, 2026
c0c3e42
refactor
ggerganov Feb 16, 2026
a3405d4
track total time
ggerganov Feb 23, 2026
1c128d9
remove junk
ggerganov Mar 29, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions examples/llama-eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# llama-eval

Simple evaluation tool for llama.cpp with support for multiple datasets.

TODO: add usage
Loading
Loading