[Tokenizer]Add Chat template by Southpika · Pull Request #8226 · PaddlePaddle/PaddleNLP

Southpika · 2024-04-03T03:56:22Z

PR types

Function optimization

PR changes

APIs

Description

Add chat_template in config file to load template

paddle-bot · 2024-04-03T03:56:27Z

Thanks for your contribution!

Southpika · 2024-04-03T04:10:53Z

移除encode_chat_inputs中对于system的单独处理，并入第一轮对话中合并（作为不可学习的内容）

codecov · 2024-04-07T02:52:32Z

Codecov Report

Attention: Patch coverage is 87.64045% with 11 lines in your changes are missing coverage. Please review.

Project coverage is 55.36%. Comparing base (cb9ba2d) to head (2124df7).
Report is 2 commits behind head on develop.

❗ Current head 2124df7 differs from pull request most recent head 7d60cff. Consider uploading reports for the commit 7d60cff to get more accurate results

Files	Patch %	Lines
paddlenlp/transformers/tokenizer_utils.py	87.20%	11 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #8226      +/-   ##
===========================================
- Coverage    55.37%   55.36%   -0.01%     
===========================================
  Files          613      614       +1     
  Lines        95870    95412     -458     
===========================================
- Hits         53084    52824     -260     
+ Misses       42786    42588     -198

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wj-Mcat

你可能还需要做以下几个事情：

更新所有支持 chat-template 的 tokenizer_config.json
测试一下先有的多轮对话训练、推理等流程，保证全流程正确。

wj-Mcat · 2024-04-26T09:58:25Z

+    ) -> str | dict[str, numpy.ndarray | paddle.Tensor]:
+        if isinstance(conversation, str):
+            conversations = [{"role": "user", "content": conversation}]
+        elif isinstance(conversation, list):


此外，也测试过新旧 chat_template 在 Predictor 中的使用是否符合预期，同时还要测试一下 gradio_ui 能够使用新旧 chat_template。

wj-Mcat · 2024-04-26T10:03:02Z

        result["conversations"] = conversation_ids
        return result

+    def _encode_chat_inputs(


如果脱离了之前设计的训推一体的 ChatTemplate，这个函数的适用性应该还挺低的，根本用不了。

所以，不太建议将 encode_chat_inputs 这块逻辑写到 tokenizer 里面去，尽量写到前处理里面去。

所以，这块的调整可能就比较大了。

考虑到目前encode_chat_inputs函数使用较广，移除造成影响范围可能较大。是否可以考虑以下策略：
默认tgt src切分方式为 src中不含有bot start token:即tgt中含有完整的user轮+bot start token
如果需要重写，则在tokenizer类中单独定义：如qwen

Southpika added 2 commits April 3, 2024 11:50

add jinja template

0e6af02

add in cfg

518b9ed

Southpika added 2 commits April 7, 2024 10:26

add system

e50325f

add notes

c6ac050

Southpika added 7 commits April 7, 2024 11:15

fix apply chat template

f1bc935

add generation flag

89a5188

update jinja ut

f35e5b3

fix error

c0cfbc7

add syntax error check

167057b

fix syntax error

d98369e

refresh codev

263a888

PaddlePaddle locked and limited conversation to collaborators Apr 8, 2024

PaddlePaddle unlocked this conversation Apr 8, 2024

zjjlivein closed this Apr 8, 2024

zjjlivein reopened this Apr 8, 2024

Southpika added 3 commits April 9, 2024 18:40

refresh codecov

9677f25

update special token map in render

393f87f

update save

2124df7

wj-Mcat reviewed Apr 26, 2024

View reviewed changes

Comment thread tests/transformers/test_chat_template.py

Comment thread paddlenlp/transformers/tokenizer_utils.py Outdated

Southpika added 4 commits April 28, 2024 17:14

add multi round chat ut

5993027

use new api

533d2cc

rm pdb

f477f61

add split ut

5745279

wj-Mcat requested changes Apr 28, 2024

View reviewed changes

Southpika added 2 commits April 28, 2024 17:45

fix des

b9e3560

Merge remote-tracking branch 'upstream/develop' into chat_template

421cfc5

fix ut

7d60cff

wawltor merged commit 71cc404 into PaddlePaddle:develop Apr 28, 2024

Conversation

Southpika commented Apr 3, 2024

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented Apr 3, 2024

Uh oh!

Southpika commented Apr 3, 2024

Uh oh!

codecov Bot commented Apr 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wj-Mcat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wj-Mcat Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wj-Mcat Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

Southpika Apr 28, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Apr 7, 2024 •

edited

Loading