Skip to content

Commit 7104eee

Browse files
soluwalanachtruong814github-advanced-security[bot]
authored andcommitted
feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences (NVIDIA-NeMo#13367)
* feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences Signed-off-by: Sam O <[email protected]> * feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * lint Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> Signed-off-by: Sam Oluwalana <[email protected]> * update test case Signed-off-by: Sam Oluwalana <[email protected]> * F string lint issue Signed-off-by: Sam Oluwalana <[email protected]> * Lint line length Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: Sam Oluwalana <[email protected]> * More lint Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * More lint Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * Missing parameters Signed-off-by: Sam Oluwalana <[email protected]> * Don't download the model everytime > Signed-off-by: Sam Oluwalana <[email protected]> * Code lint Signed-off-by: Sam Oluwalana <[email protected]> * Rollback change on whitespace in sentencepiece tokenizer Signed-off-by: Sam Oluwalana <[email protected]> * Ensure all loss_masks are labeled as loss_mask Update Tests Signed-off-by: Sam Oluwalana <[email protected]> do not overwrite chat_template by default Signed-off-by: Alexandros Koumparoulis <[email protected]> Can't relative import test function Signed-off-by: Sam Oluwalana <[email protected]> . * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * Remove duplicate function left in for rebase only" Signed-off-by: Sam Oluwalana <[email protected]> * label loss mask actually loss mask (as there is an attention mask as well) Signed-off-by: Sam Oluwalana <[email protected]> * Lint line length Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * default use_hf_tokenzier_chat_template = True Signed-off-by: Sam Oluwalana <[email protected]> * Add original build_samples_mapping back in Signed-off-by: Sam Oluwalana <[email protected]> * PR feedback incorporation Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * PR feedback incorporation Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * Fix tests Signed-off-by: Sam O <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * PR changes Signed-off-by: Sam Oluwalana <[email protected]> * Apply isort and black reformatting Signed-off-by: soluwalana <[email protected]> * Skip eval unit test (NVIDIA-NeMo#13635) Signed-off-by: Charlie Truong <[email protected]> * Map directly to the NeMo tokenizer path Signed-off-by: Sam Oluwalana <[email protected]> * Rollback change pointing to image cache Signed-off-by: Sam Oluwalana <[email protected]> * Potential fix for code scanning alert no. 14984: Unused import Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: Sam O <[email protected]> Signed-off-by: Sam Oluwalana <[email protected]> * correct path in TestData Signed-off-by: Sam Oluwalana <[email protected]> * correct path in TestData Signed-off-by: Sam Oluwalana <[email protected]> --------- Signed-off-by: Sam O <[email protected]> Signed-off-by: Sam Oluwalana <[email protected]> Signed-off-by: soluwalana <[email protected]> Signed-off-by: Charlie Truong <[email protected]> Signed-off-by: Sam O <[email protected]> Co-authored-by: soluwalana <[email protected]> Co-authored-by: Charlie Truong <[email protected]> Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Signed-off-by: jianbinc <[email protected]>
1 parent 9eca236 commit 7104eee

File tree

16 files changed

+1488
-500
lines changed

16 files changed

+1488
-500
lines changed

nemo/collections/common/tokenizers/huggingface/auto_tokenizer.py

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ def __init__(
4747
use_fast: Optional[bool] = True,
4848
trust_remote_code: Optional[bool] = False,
4949
include_special_tokens: bool = False,
50+
chat_template: Optional[str] = None,
5051
):
5152
"""
5253
Args:
@@ -68,14 +69,18 @@ def __init__(
6869
use_fast: whether to use fast HuggingFace tokenizer
6970
include_special_tokens: when True, converting text to ids will include special tokens / prompt tokens (if
7071
any), yielding self.tokenizer(text).input_ids
72+
chat_template: The chat template string to format "messages" with against the underlying HF tokneizer with
73+
apply_chat_template function
7174
"""
7275
try:
73-
self._initialize_tokenizer(pretrained_model_name, vocab_file, merges_file, use_fast, trust_remote_code)
76+
self._initialize_tokenizer(
77+
pretrained_model_name, vocab_file, merges_file, use_fast, trust_remote_code, chat_template
78+
)
7479
assert self.tokenizer, "tokenizer not initialized"
7580
except Exception:
7681
try:
7782
self._initialize_tokenizer(
78-
pretrained_model_name, vocab_file, merges_file, not use_fast, trust_remote_code
83+
pretrained_model_name, vocab_file, merges_file, not use_fast, trust_remote_code, chat_template
7984
)
8085
assert self.tokenizer, "tokenizer not initialized"
8186
except Exception as e:
@@ -168,6 +173,7 @@ def _initialize_tokenizer(
168173
merges_file: Optional[str] = None,
169174
use_fast: Optional[bool] = False,
170175
trust_remote_code: Optional[bool] = False,
176+
chat_template: Optional[str] = None,
171177
):
172178
# this logic deals with different huggingface tokenizers having different positional args
173179
if vocab_file is None:
@@ -192,6 +198,12 @@ def _initialize_tokenizer(
192198
trust_remote_code=trust_remote_code,
193199
)
194200

201+
if chat_template is not None:
202+
if getattr(self.tokenizer, 'chat_template', None) is not None:
203+
logging.info("You are overwriting tokenizer's chat template, confirm this is intended.")
204+
self.tokenizer.chat_template = chat_template
205+
self.tokenizer.chat_template_format = "jinja"
206+
195207
@property
196208
def vocab_size(self):
197209
"""

0 commit comments

Comments
 (0)