Skip to content

Conversation

@davidheineman
Copy link
Member

@davidheineman davidheineman commented Apr 2, 2025

Our in-loop evals subtract 1 from the denominator when normalizing by continuation length.

This means our current in-loop does not match the OLMES standard (in our downstream evals), the downstream normalization is here. It appears the -1 term is an artifact from when we originally implemented in-loop evals.

This PR will match the normalization between in-loop to downstream.

Sanity Check

I have a gist copmaring the in-loop vs. downstream evals on /oe-training-default/ai2-llm/checkpoints/mayeec/olmo-cookbook-1b-5xC-dclm-baseline-natural-9a234fde/step53971. For downstream task arc_challenge:rc::olmes:full, this should match the in-loop key arc_challenge_test_rc_5shot:

https://gist.github.com/davidheineman/737ae4f50da7fe74c4e66d165025c68c:

(base) dhei»dhei-mbp ~/ai2 ⋈ python sanity_check.py
========== Bits-per-byte ==========
Downstream eval:       0.8809
In-loop eval (fixed):  0.8811
In-loop eval (broken): 0.9653

========== Accuracy per char ==========
Downstream eval:       0.3643
In-loop eval (fixed):  0.3643
In-loop eval (broken): 0.3601
Instance-level diff from downstream: broken=9.0, fixed=6.0

Copy link

@kyleclo kyleclo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; can u update the changelog & make version update too in this PR

@davidheineman davidheineman merged commit 7363946 into main Apr 3, 2025
8 checks passed
@davidheineman davidheineman deleted the in-loop-norm-fix branch April 3, 2025 16:29
davidheineman added a commit to allenai/OLMo-core that referenced this pull request Apr 4, 2025
Merge and update in-loop evals to match
allenai/OLMo-in-loop-evals#6

Description from the `ai2-olmo-eval==0.7.1` PR:
> our current in-loop does not match the OLMES standard (in our
downstream evals), [the downstream normalization is
here](https://github.com/allenai/oe-eval-internal/blob/main/oe_eval/metrics/metric.py#L229-L231).
It appears the -1 term is an artifact from when we originally
implemented in-loop evals.

This will keep the original version of the metrics and add a
version with the fixed normalization (`_v2`)

---------

Co-authored-by: Dirk Groeneveld <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants