fix bias in eval per-char and per-byte normalization #6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Our in-loop evals subtract 1 from the denominator when normalizing by continuation length.
This means our current in-loop does not match the OLMES standard (in our downstream evals), the downstream normalization is here. It appears the
-1term is an artifact from when we originally implemented in-loop evals.This PR will match the normalization between in-loop to downstream.
Sanity Check
I have a gist copmaring the in-loop vs. downstream evals on
/oe-training-default/ai2-llm/checkpoints/mayeec/olmo-cookbook-1b-5xC-dclm-baseline-natural-9a234fde/step53971. For downstream taskarc_challenge:rc::olmes:full, this should match the in-loop keyarc_challenge_test_rc_5shot:https://gist.github.com/davidheineman/737ae4f50da7fe74c4e66d165025c68c: