Add IFBench eval suite (ifeval_ood + ifeval_mt) aligned with oe-eval-internal by finbarrtimbers · Pull Request #171 · allenai/olmo-eval

finbarrtimbers · 2026-05-05T14:25:34Z

Adds the IFBench instruction-following benchmark to olmo-eval-internal, matching the ifbench::tulu suite from oe-eval-internal.

Validation

uv run olmo-eval beaker launch -y \
    -m Qwen/Qwen3-4B \
    --harness default \
    -B ai2/oe-adapt \
    -o provider.kind=vllm_server \
    -t ifbench \
    --gpus 1 -c h100 -w ai2/open-instruct-dev -p urgent

Reference (allenai/oe-eval-internal):

uv run python oe_eval/launch.py \
    --model Qwen/Qwen3-4B \
    --model-type vllm \
    --model-args '{"trust_remote_code": true, "max_length": 32768}' \
    --task ifbench::tulu \
    --gpus 1 --cluster ai2/jupiter \
    --beaker-workspace ai2/open-instruct-dev \
    --beaker-priority urgent --beaker-budget ai2/oe-adapt

Results (Qwen3-4B, `prompt_level_loose_acc`)

sub-task	n	oe-eval-internal	this PR
`ifeval_mt_wildchat_unused_withRewrite`	1774	0.527	0.544
`ifeval_mt_ood_wildchat_unused_withRewrite`	1387	0.358	0.360
`ifeval_ood`	300	0.213	0.240
`ifbench` (suite average)	—	—	0.381

Beaker run: https://beaker.org/ex/01KQTMJNA0YMFJ3AVSSAVG4G1A

…laude Opus 4.7 <noreply@anthropic.com>

…ndored IFEval registry Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e850b7565

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

…By: Claude Opus 4.7 <noreply@anthropic.com>

… Opus 4.7 <noreply@anthropic.com>

undfined

looks good!

…4.7 <noreply@anthropic.com>

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…udget Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit a75363c.

finbarrtimbers added 3 commits May 4, 2026 14:51

added ifbench

6c00d58

Add missing IFBench scorer, metrics, task and tests Co-Authored-By: C…

bb07320

…laude Opus 4.7 <noreply@anthropic.com>

Align ifbench with oe-eval-internal: ifeval_ood + ifeval_mt suite, ve…

1e850b7

…ndored IFEval registry Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread src/olmo_eval/common/scorers/ifeval.py

finbarrtimbers requested a review from ValentinaPy May 11, 2026 20:52

finbarrtimbers added 2 commits May 13, 2026 08:40

Replace vendored IFBench with ifbench package dependency Co-Authored-…

2f79d28

…By: Claude Opus 4.7 <noreply@anthropic.com>

Fix prompt_to_repeat fallback in IFEval scorer Co-Authored-By: Claude…

62985dc

… Opus 4.7 <noreply@anthropic.com>

finbarrtimbers requested review from undfined and removed request for ValentinaPy May 13, 2026 14:54

undfined approved these changes May 13, 2026

View reviewed changes

Comment thread src/olmo_eval/common/scorers/ifeval.py Outdated

finbarrtimbers added 3 commits May 14, 2026 09:07

Make ifbench import lazy in IFEvalScorer Co-Authored-By: Claude Opus …

091b652

…4.7 <noreply@anthropic.com>

Make Beaker budget optional; fall back to workspace's bound budget Co…

a75363c

…-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Revert "Make Beaker budget optional; fall back to workspace's bound b…

7245874

…udget Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>" This reverts commit a75363c.

finbarrtimbers merged commit f7f808a into main May 14, 2026
4 checks passed

finbarrtimbers deleted the finbarr/ifbench branch May 14, 2026 15:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IFBench eval suite (ifeval_ood + ifeval_mt) aligned with oe-eval-internal#171

Add IFBench eval suite (ifeval_ood + ifeval_mt) aligned with oe-eval-internal#171
finbarrtimbers merged 8 commits into
mainfrom
finbarr/ifbench

finbarrtimbers commented May 5, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

undfined left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

finbarrtimbers commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Validation

Results (Qwen3-4B, prompt_level_loose_acc)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

undfined left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

finbarrtimbers commented May 5, 2026 •

edited

Loading

Results (Qwen3-4B, `prompt_level_loose_acc`)