repair npu matmulv2_grad and comm_init_hccl #33719

Baibaifan · 2021-06-22T06:57:58Z

PR types

Bug fixes

PR changes

OPs

Describe

1.repair npu matmulv2_grad supported 3*3->2 and add the UT test.
2.repair npu comm_init_hccl op by adding to send fake data to build connection.

matmul_gradv2 precision npu and gpu in fp16 for 5 epochs.

npu:

gpu:

paddle-bot-old · 2021-06-22T06:58:02Z

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

pangyoki · 2021-06-23T02:53:35Z

paddle/fluid/operators/matmul_v2_op_npu.cc

-              NpuOpRunner("BatchMatMul", {*x, *dout}, {*dy},
-                          {{"adj_x1", true}, {"adj_x2", false}});
-          runner_dy.Run(stream);
+          if ((x->dims().size() == 3) && (dout->dims().size() == 3) &&


x dims为3，y dims为2的情况，前向是不是也不能用BatchMatMul

前向可以，这里做了纬度判断是因为输出是个2纬，但是输入是两个3纬需要转化下

pangyoki

matmul v2 op fp16类型可能存在输入 3维乘 2维的情况。BatchMatMul NPU op的fp32类型不支持这种情况。
目前情况下不会使用fp32数据类型，输入 3维乘 2维的情况。所以暂时没对fp32做支持。
后续需要添加fp32类型对这种情况的处理。

wanghuancoder

LGTM for unittest.skipIf

void-main · 2021-06-23T06:45:47Z

paddle/fluid/operators/collective/c_comm_init_hccl_op.cc

+
+    //  Build comm
+    float* buff;
+    int32_t size = 20;


这里为啥是20？

仅用于初始化

void-main · 2021-06-23T06:46:43Z

paddle/fluid/operators/collective/c_comm_init_hccl_op.cc

+    for (int32_t idx = 0; idx < size; idx++) {
+      input[idx] = 1.0;
+    }
+    aclrtMalloc(reinterpret_cast<void**>(&buff), size * sizeof(float),


这种函数需要确保成功吧，得拿ACLCHECK包一下

记录，下一个pr优化进去

repair npu matmul_grad and comm_init_hccl

04edd14

Baibaifan closed this Jun 22, 2021

Baibaifan reopened this Jun 22, 2021

pangyoki reviewed Jun 23, 2021

View reviewed changes

pangyoki approved these changes Jun 23, 2021

View reviewed changes

wanghuancoder approved these changes Jun 23, 2021

View reviewed changes

void-main reviewed Jun 23, 2021

View reviewed changes

gongweibao merged commit 9bf00cd into PaddlePaddle:develop Jun 23, 2021

pangyoki mentioned this pull request Jun 24, 2021

change TensorCopy to ShareDataWith in matmul_grad op #33755

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

repair npu matmulv2_grad and comm_init_hccl #33719

repair npu matmulv2_grad and comm_init_hccl #33719

Uh oh!

Baibaifan commented Jun 22, 2021 •

edited

Loading

Uh oh!

paddle-bot-old bot commented Jun 22, 2021

Uh oh!

pangyoki Jun 23, 2021

Uh oh!

Baibaifan Jun 23, 2021

Uh oh!

pangyoki left a comment

Uh oh!

wanghuancoder left a comment

Uh oh!

void-main Jun 23, 2021

Uh oh!

Baibaifan Jun 23, 2021

Uh oh!

void-main Jun 23, 2021

Uh oh!

Baibaifan Jun 23, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

repair npu matmulv2_grad and comm_init_hccl #33719

repair npu matmulv2_grad and comm_init_hccl #33719

Uh oh!

Conversation

Baibaifan commented Jun 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot-old bot commented Jun 22, 2021

Uh oh!

pangyoki Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

Baibaifan Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

pangyoki left a comment

Choose a reason for hiding this comment

Uh oh!

wanghuancoder left a comment

Choose a reason for hiding this comment

Uh oh!

void-main Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

Baibaifan Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

void-main Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

Baibaifan Jun 23, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Baibaifan commented Jun 22, 2021 •

edited

Loading