Skip to content

Fix aliased list references in _tokenize_fn#324

Open
Mr-Neutr0n wants to merge 1 commit intotatsu-lab:mainfrom
Mr-Neutr0n:fix/tokenize-fn-aliasing-bug
Open

Fix aliased list references in _tokenize_fn#324
Mr-Neutr0n wants to merge 1 commit intotatsu-lab:mainfrom
Mr-Neutr0n:fix/tokenize-fn-aliasing-bug

Conversation

@Mr-Neutr0n
Copy link
Copy Markdown

Summary

  • input_ids = labels = [...] in _tokenize_fn assigns both variables to the same list object, so mutating one silently mutates the other. The same applies to input_ids_lens = labels_lens = [...].
  • The current call-site in preprocess() masks this with a copy.deepcopy, but the bug is latent: any future caller that reads labels from the returned dict will get the exact same object as input_ids, leading to silent data corruption when the label-masking step overwrites source tokens with IGNORE_INDEX.
  • This PR builds separate lists for input_ids/labels and input_ids_lens/labels_lens so each is an independent object.

Repro

# Demonstrates the aliasing problem:
a = b = [1, 2, 3]
a[0] = 99
print(b)  # [99, 2, 3] — b was mutated too

The same thing happens in _tokenize_fn with input_ids and labels.

Test plan

  • Verified that the fix produces correct, independent lists
  • No functional changes to the training pipeline (only removes a latent aliasing hazard)

`input_ids = labels = [...]` assigns both variables to the same
list object, so mutating one silently mutates the other.  The same
applies to `input_ids_lens = labels_lens = [...]`.

The current call-site in `preprocess()` masks this with a
`copy.deepcopy`, but the bug is latent: any future caller that
reads `labels` from the returned dict will get the exact same
object as `input_ids`, leading to silent data corruption when
the label-masking step overwrites source tokens with IGNORE_INDEX.

Build separate lists for `input_ids`/`labels` and
`input_ids_lens`/`labels_lens` so each is an independent object.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant