Commit 7104eee
feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences (NVIDIA-NeMo#13367)
* feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences
Signed-off-by: Sam O <[email protected]>
* feat - GPTSFTChatDataset alignment with OpenAI Messages, compatibility with packed sequences
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* lint
Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
Signed-off-by: Sam Oluwalana <[email protected]>
* update test case
Signed-off-by: Sam Oluwalana <[email protected]>
* F string lint issue
Signed-off-by: Sam Oluwalana <[email protected]>
* Lint line length
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: Sam Oluwalana <[email protected]>
* More lint
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* More lint
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* Missing parameters
Signed-off-by: Sam Oluwalana <[email protected]>
* Don't download the model everytime
>
Signed-off-by: Sam Oluwalana <[email protected]>
* Code lint
Signed-off-by: Sam Oluwalana <[email protected]>
* Rollback change on whitespace in sentencepiece tokenizer
Signed-off-by: Sam Oluwalana <[email protected]>
* Ensure all loss_masks are labeled as loss_mask
Update Tests
Signed-off-by: Sam Oluwalana <[email protected]>
do not overwrite chat_template by default
Signed-off-by: Alexandros Koumparoulis <[email protected]>
Can't relative import test function
Signed-off-by: Sam Oluwalana <[email protected]>
.
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* Remove duplicate function left in for rebase only"
Signed-off-by: Sam Oluwalana <[email protected]>
* label loss mask actually loss mask (as there is an attention mask as well)
Signed-off-by: Sam Oluwalana <[email protected]>
* Lint line length
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* default use_hf_tokenzier_chat_template = True
Signed-off-by: Sam Oluwalana <[email protected]>
* Add original build_samples_mapping back in
Signed-off-by: Sam Oluwalana <[email protected]>
* PR feedback incorporation
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* PR feedback incorporation
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* Fix tests
Signed-off-by: Sam O <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* PR changes
Signed-off-by: Sam Oluwalana <[email protected]>
* Apply isort and black reformatting
Signed-off-by: soluwalana <[email protected]>
* Skip eval unit test (NVIDIA-NeMo#13635)
Signed-off-by: Charlie Truong <[email protected]>
* Map directly to the NeMo tokenizer path
Signed-off-by: Sam Oluwalana <[email protected]>
* Rollback change pointing to image cache
Signed-off-by: Sam Oluwalana <[email protected]>
* Potential fix for code scanning alert no. 14984: Unused import
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: Sam O <[email protected]>
Signed-off-by: Sam Oluwalana <[email protected]>
* correct path in TestData
Signed-off-by: Sam Oluwalana <[email protected]>
* correct path in TestData
Signed-off-by: Sam Oluwalana <[email protected]>
---------
Signed-off-by: Sam O <[email protected]>
Signed-off-by: Sam Oluwalana <[email protected]>
Signed-off-by: soluwalana <[email protected]>
Signed-off-by: Charlie Truong <[email protected]>
Signed-off-by: Sam O <[email protected]>
Co-authored-by: soluwalana <[email protected]>
Co-authored-by: Charlie Truong <[email protected]>
Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
Signed-off-by: jianbinc <[email protected]>1 parent 9eca236 commit 7104eee
File tree
16 files changed
+1488
-500
lines changed- nemo
- collections
- common/tokenizers
- huggingface
- llm/gpt
- data
- model
- nlp
- models/language_modeling
- modules/common
- parts
- export
- utils
- loggers
- tests
- collections/llm/gpt/data
- evaluation
16 files changed
+1488
-500
lines changedLines changed: 14 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
50 | 51 | | |
51 | 52 | | |
52 | 53 | | |
| |||
68 | 69 | | |
69 | 70 | | |
70 | 71 | | |
| 72 | + | |
| 73 | + | |
71 | 74 | | |
72 | 75 | | |
73 | | - | |
| 76 | + | |
| 77 | + | |
| 78 | + | |
74 | 79 | | |
75 | 80 | | |
76 | 81 | | |
77 | 82 | | |
78 | | - | |
| 83 | + | |
79 | 84 | | |
80 | 85 | | |
81 | 86 | | |
| |||
168 | 173 | | |
169 | 174 | | |
170 | 175 | | |
| 176 | + | |
171 | 177 | | |
172 | 178 | | |
173 | 179 | | |
| |||
192 | 198 | | |
193 | 199 | | |
194 | 200 | | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
195 | 207 | | |
196 | 208 | | |
197 | 209 | | |
| |||
0 commit comments