[Embedding] Add inf-cl in embedding trainer by jie-z-0607 · Pull Request #9673 · PaddlePaddle/PaddleNLP

jie-z-0607 · 2024-12-23T08:29:08Z

PR types

Function optimization

PR changes

Others

Description

在embedding训练中增加inf_cl_loss，在超大batch_size下能有效节省显存消耗。

经测试，inf-cl算子能够与原有损失函数有效对齐：

以数据类型设置bf16，group_size设置1，gradient_accumulation_steps设置4为例，inf_cl_loss与原有contrastive_loss的收敛曲线如下：

经测试，在超大batch_size下，inf-cl算子能够有效降低embedding训练时的显存消耗：

在8张A100（80G）显卡下，以数据类型设置bf16，group_size设置4，gradient_accumulation_steps设置4096为例，inf_cl_loss与原有contrastive_loss的显存占用对比如下：

参数设置	显存占用	首个step完成耗费时间
不使用inf-cl；embedding_negatives_cross_device=True	42238MiB；42526MiB； 42526MiB；42470MiB； 42470MiB；42526MiB； 42526MiB；42182MiB	48min42s
使用inf-cl；embedding_negatives_cross_device=Flase	29630MiB；28392MiB； 28372MiB；28308MiB； 28320MiB；28384MiB； 28316MiB；28070MiB	49min56s

在8张A100（80G）显卡下，以数据类型设置bf16，group_size设置1，gradient_accumulation_steps设置16384（总计batch_size 128K）为例，inf_cl_loss与原有contrastive_loss的显存占用对比如下：

参数设置	显存占用	首个step完成耗费时间
不使用inf-cl；embedding_negatives_cross_device=True	超出显存限制
使用inf-cl；embedding_negatives_cross_device=Flase	46324MiB；45192MiB； 44926MiB；45180MiB； 44674MiB；45022MiB； 45032MiB；44904MiB	2h23min46s

paddle-bot · 2024-12-23T08:29:12Z

Thanks for your contribution!

codecov · 2024-12-23T09:02:15Z

Codecov Report

Attention: Patch coverage is 13.95349% with 37 lines in your changes missing coverage. Please review.

Project coverage is 52.76%. Comparing base (1842d6d) to head (8f55e52).
Report is 268 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/contrastive_loss.py	18.18%	27 Missing ⚠️
paddlenlp/trl/embedding_trainer.py	0.00%	10 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9673      +/-   ##
===========================================
- Coverage    53.18%   52.76%   -0.43%     
===========================================
  Files          718      718              
  Lines       113340   112338    -1002     
===========================================
- Hits         60282    59276    -1006     
- Misses       53058    53062       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ZHUI · 2024-12-23T09:13:11Z

+__all__ = ["Simple_Inf_cl_loss", "Matryoshka_Inf_cl_loss"]
+
+
+class Simple_Inf_cl_loss(nn.Layer):


加一些注释

ZHUI · 2024-12-23T09:15:02Z

 from paddle.base import core
 from paddle.distributed import fleet

+from ops.src.paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (


Suggested change

from ops.src.paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (

from paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (

ZHUI · 2024-12-23T09:26:05Z

 from paddle.base import core
 from paddle.distributed import fleet

+from ops.src.paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (


这个没有默认安装,需要 try except一下

ZHUI · 2024-12-24T08:29:06Z

+        group_size = p_reps.shape[0] // q_reps.shape[0]  # Number of keys per query
+        labels = paddle.arange(q_reps.shape[0], dtype="int64")  # Generate labels for queries
+        labels = labels * group_size  # Adjust labels based on group size
+        loss = cal_inf_loss(q_reps, p_reps, labels=labels, scale=None, head_dim=self.head_dim)


你把import 的代码放到这里吧, 然后没有包的话，直接报错。

try: from paddlenlp_kernel.triton.inf_cl import cal_inf_loss except ImportError: logger.warning( "Paddlenlp_kernels are not available, which means the inf_cl loss cannot be used. If you wish to use the inf_cl loss, please follow the instructions in the README.md on the `ops`." )

add inf-cl in embedding trainer

dd7fc8a

paddle-bot Bot added the contributor label Dec 23, 2024

paddle-bot Bot assigned KB-Ding Dec 23, 2024

ZHUI reviewed Dec 23, 2024

View reviewed changes

jie-z-0607 added 4 commits December 23, 2024 18:00

add annotations and fix import

3b06655

rename inf_cl_loss and fix warning

e3c55c3

rename simple_inf_cl

6b6a108

Change inf_cl location

d69ac4a

ZHUI reviewed Dec 24, 2024

View reviewed changes

jie-z-0607 added 2 commits December 24, 2024 16:38

Change import location

f05dd61

Change error information

8f55e52

jie-z-0607 requested a review from ZHUI December 24, 2024 09:00

DesmonDay approved these changes Dec 25, 2024

View reviewed changes

ZHUI changed the title ~~add inf-cl in embedding trainer~~ [Embedding] Add inf-cl in embedding trainer Dec 25, 2024

ZHUI merged commit 40fa402 into PaddlePaddle:develop Dec 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Embedding] Add inf-cl in embedding trainer#9673

[Embedding] Add inf-cl in embedding trainer#9673
ZHUI merged 7 commits into
PaddlePaddle:developfrom
jie-z-0607:add_inf-cl_in_embedding

jie-z-0607 commented Dec 23, 2024

Uh oh!

paddle-bot Bot commented Dec 23, 2024

Uh oh!

codecov Bot commented Dec 23, 2024 •

edited

Loading

Uh oh!

ZHUI Dec 23, 2024

Uh oh!

ZHUI Dec 23, 2024

Uh oh!

ZHUI Dec 23, 2024

Uh oh!

ZHUI Dec 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		__all__ = ["Simple_Inf_cl_loss", "Matryoshka_Inf_cl_loss"]


		class Simple_Inf_cl_loss(nn.Layer):

	from ops.src.paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (
	from paddlenlp_kernel.triton.inf_cl.inf_cl_loss import (

Conversation

jie-z-0607 commented Dec 23, 2024

PR types

PR changes

Description

在embedding训练中增加inf_cl_loss，在超大batch_size下能有效节省显存消耗。

经测试，inf-cl算子能够与原有损失函数有效对齐：

经测试，在超大batch_size下，inf-cl算子能够有效降低embedding训练时的显存消耗：

Uh oh!

paddle-bot Bot commented Dec 23, 2024

Uh oh!

codecov Bot commented Dec 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ZHUI Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI Dec 23, 2024

Choose a reason for hiding this comment

Uh oh!

ZHUI Dec 24, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Dec 23, 2024 •

edited

Loading