fix bias in eval per-char and per-byte normalization #6

davidheineman · 2025-04-02T19:45:53Z

Our in-loop evals subtract 1 from the denominator when normalizing by continuation length.

This means our current in-loop does not match the OLMES standard (in our downstream evals), the downstream normalization is here. It appears the -1 term is an artifact from when we originally implemented in-loop evals.

This PR will match the normalization between in-loop to downstream.

Sanity Check

I have a gist copmaring the in-loop vs. downstream evals on /oe-training-default/ai2-llm/checkpoints/mayeec/olmo-cookbook-1b-5xC-dclm-baseline-natural-9a234fde/step53971. For downstream task arc_challenge:rc::olmes:full, this should match the in-loop key arc_challenge_test_rc_5shot:

https://gist.github.com/davidheineman/737ae4f50da7fe74c4e66d165025c68c:

(base) dhei»dhei-mbp ~/ai2 ⋈ python sanity_check.py
========== Bits-per-byte ==========
Downstream eval:       0.8809
In-loop eval (fixed):  0.8811
In-loop eval (broken): 0.9653

========== Accuracy per char ==========
Downstream eval:       0.3643
In-loop eval (fixed):  0.3643
In-loop eval (broken): 0.3601
Instance-level diff from downstream: broken=9.0, fixed=6.0

kyleclo

lgtm; can u update the changelog & make version update too in this PR

Merge and update in-loop evals to match allenai/OLMo-in-loop-evals#6 Description from the `ai2-olmo-eval==0.7.1` PR: > our current in-loop does not match the OLMES standard (in our downstream evals), [the downstream normalization is here](https://github.com/allenai/oe-eval-internal/blob/main/oe_eval/metrics/metric.py#L229-L231). It appears the -1 term is an artifact from when we originally implemented in-loop evals. This will keep the original version of the metrics and add a version with the fixed normalization (`_v2`) --------- Co-authored-by: Dirk Groeneveld <[email protected]>

fix cont norm to match downstream

40a61d3

davidheineman requested a review from kyleclo April 2, 2025 19:45

davidheineman assigned kyleclo Apr 2, 2025

davidheineman requested a review from undfined April 2, 2025 19:47

kyleclo approved these changes Apr 2, 2025

View reviewed changes

davidheineman added 4 commits April 3, 2025 01:20

add v2 metrics without norm

fee6579

small fix

fe685bf

run lint

cbd2dc9

add entry to all metrics

d5bcbc0

davidheineman mentioned this pull request Apr 3, 2025

fix in-loop normalization with v2 allenai/OLMo-core#243

Merged

davidheineman added 2 commits April 3, 2025 01:53

fix acc

b4060d5

fix lint

0c38b0a

davidheineman merged commit 7363946 into main Apr 3, 2025
8 checks passed

davidheineman deleted the in-loop-norm-fix branch April 3, 2025 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix bias in eval per-char and per-byte normalization #6

fix bias in eval per-char and per-byte normalization #6

Uh oh!

davidheineman commented Apr 2, 2025 •

edited

Loading

Uh oh!

kyleclo left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix bias in eval per-char and per-byte normalization #6

fix bias in eval per-char and per-byte normalization #6

Uh oh!

Conversation

davidheineman commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Sanity Check

Uh oh!

kyleclo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidheineman commented Apr 2, 2025 •

edited

Loading