Fix aliased list references in _tokenize_fn#324
Open
Mr-Neutr0n wants to merge 1 commit intotatsu-lab:mainfrom
Open
Fix aliased list references in _tokenize_fn#324Mr-Neutr0n wants to merge 1 commit intotatsu-lab:mainfrom
Mr-Neutr0n wants to merge 1 commit intotatsu-lab:mainfrom
Conversation
`input_ids = labels = [...]` assigns both variables to the same list object, so mutating one silently mutates the other. The same applies to `input_ids_lens = labels_lens = [...]`. The current call-site in `preprocess()` masks this with a `copy.deepcopy`, but the bug is latent: any future caller that reads `labels` from the returned dict will get the exact same object as `input_ids`, leading to silent data corruption when the label-masking step overwrites source tokens with IGNORE_INDEX. Build separate lists for `input_ids`/`labels` and `input_ids_lens`/`labels_lens` so each is an independent object.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
input_ids = labels = [...]in_tokenize_fnassigns both variables to the same list object, so mutating one silently mutates the other. The same applies toinput_ids_lens = labels_lens = [...].preprocess()masks this with acopy.deepcopy, but the bug is latent: any future caller that readslabelsfrom the returned dict will get the exact same object asinput_ids, leading to silent data corruption when the label-masking step overwrites source tokens withIGNORE_INDEX.input_ids/labelsandinput_ids_lens/labels_lensso each is an independent object.Repro
The same thing happens in
_tokenize_fnwithinput_idsandlabels.Test plan