Skip to content

Conversation

@davidheineman
Copy link
Member

@davidheineman davidheineman commented Mar 10, 2025

This PR adds the BPB over the correct continuation in GSM, Minerva, MBPP and HumanEval.

The BPB task configs use the setup implemented in https://github.com/allenai/oe-eval-internal/pull/374, and the implementation is the same as allenai/OLMo#777

Evaluation Setup

The "gold continuation" appears as follows:

Question: Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?
Answer: Janet sells 16 - 3 - 4 = 9 duck eggs a day. She makes 9 * 2 = $18 every day at the farmer’s market. So the answer is 18.

Sanity check

I tested this with the in-loop trainer:

torchrun --nproc-per-node=4 src/scripts/train/OLMo2-1B.py train test_run .

Copy link

@undfined undfined left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Nit: Do the json configs need to be unformatted? If not then a lot easier to read/scan down the road.

I guess one other nit: We're kind of mixing and matching prefixing these with codex_ here and in oe-eval-internal. Would be nice to be consistent.

@davidheineman
Copy link
Member Author

Thanks for taking a look! Updated mbpp -> codex_mbpp.

Agree on your first nit: but I won't format the .json in this PR, since the other .json files are not formatted.

@davidheineman davidheineman merged commit 2f01ec8 into main Mar 10, 2025
7 checks passed
@davidheineman davidheineman deleted the new-gen-tasks branch March 10, 2025 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants