[Tokenizer]Add Chat template#8226
Conversation
|
Thanks for your contribution! |
|
移除 |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #8226 +/- ##
===========================================
- Coverage 55.37% 55.36% -0.01%
===========================================
Files 613 614 +1
Lines 95870 95412 -458
===========================================
- Hits 53084 52824 -260
+ Misses 42786 42588 -198 ☔ View full report in Codecov by Sentry. |
wj-Mcat
left a comment
There was a problem hiding this comment.
你可能还需要做以下几个事情:
- 更新所有支持 chat-template 的 tokenizer_config.json
- 测试一下先有的多轮对话训练、推理等流程,保证全流程正确。
| ) -> str | dict[str, numpy.ndarray | paddle.Tensor]: | ||
| if isinstance(conversation, str): | ||
| conversations = [{"role": "user", "content": conversation}] | ||
| elif isinstance(conversation, list): |
There was a problem hiding this comment.
此外,也测试过新旧 chat_template 在 Predictor 中的使用是否符合预期,同时还要测试一下 gradio_ui 能够使用新旧 chat_template。
| result["conversations"] = conversation_ids | ||
| return result | ||
|
|
||
| def _encode_chat_inputs( |
There was a problem hiding this comment.
如果脱离了之前设计的训推一体的 ChatTemplate,这个函数的适用性应该还挺低的,根本用不了。
所以,不太建议将 encode_chat_inputs 这块逻辑写到 tokenizer 里面去,尽量写到前处理里面去。
所以,这块的调整可能 就比较大了。
There was a problem hiding this comment.
考虑到目前encode_chat_inputs函数使用较广,移除造成影响范围可能较大。是否可以考虑以下策略:
默认tgt src切分方式为 src中不含有bot start token:即tgt中含有完整的user轮+bot start token
如果需要重写,则在tokenizer类中单独定义:如qwen
PR types
Function optimization
PR changes
APIs
Description
Add
chat_templatein config file to load template