Skip to content

improve performance of DepthwiseConv(NHWC)#31677

Merged
zhangting2020 merged 11 commits intoPaddlePaddle:developfrom
OuyangChao:depthwise_conv
Apr 7, 2021
Merged

improve performance of DepthwiseConv(NHWC)#31677
zhangting2020 merged 11 commits intoPaddlePaddle:developfrom
OuyangChao:depthwise_conv

Conversation

@OuyangChao
Copy link
Contributor

@OuyangChao OuyangChao commented Mar 16, 2021

PR types

Performance optimization

PR changes

OPs

Describe

improve performance of DepthwiseConv(NHWC)

Forward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)
  • Before: Input transpose + NCHW kernel + Output transpose
  • This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id input_shape(NHWC) filter_size(CHW) stride padding dilation groups before this PR improve
0 (8, 64, 64, 1024) (1024, 3, 3) 1 1 1 1024 4.52ms 2.00ms +55.75%
1 (8, 64, 64, 2048) (2048, 3, 3) 1 1 1 2048 9.21ms 4.09ms +55.59%
2 (8, 64, 64, 1024) (2048, 3, 3) 1 1 1 1024 9.41ms 3.65ms +61.21%
3 (8, 64, 64, 1024) (1024, 3, 3) 2 1 1 1024 2.72ms 0.86ms +68.38%
4 (8, 64, 64, 1024) (1024, 5, 5) 1 1 1 1024 14.47ms 7.24ms +49.97%
5 (8, 256, 256, 64) (64, 3, 3) 1 1 1 64 4.51ms 2.09ms +53.66%
6 (8, 64, 128, 2048) (2048, 3, 3) 1 12 12 2048 17.44ms 10.65ms +38.93%
7 (8, 64, 128, 2048) (2048, 3, 3) 1 24 24 2048 17.02ms 9.27ms +45.53%
8 (8, 64, 128, 2048) (2048, 3, 3) 1 36 36 2048 15.91ms 7.05ms +55.69%

Backward of DepthwiseConv(NHWC)

import paddle
import paddle.nn as nn

x_var = paddle.uniform((8, 64, 64, 1024), dtype='float32', min=-1., max=1.)
x_var.stop_gradient = False
conv = nn.Conv2D(1024, 1024, (3, 3), stride=1, padding=1, dilation=1, groups=1024, data_format='NHWC')
y_var = conv(x_var)
paddle.grad(y_var, x_var)
  • Before: Input transpose + NCHW kernel + Output transpose
  • This PR: Filter transpose + NHWC kernel

Tested with GeForce GTX Titan X

id input_shape(NHWC) filter_size(CHW) stride padding dilation groups before this PR improve
0 (8, 64, 64, 1024) (1024, 3, 3) 1 1 1 1024 9.11ms 6.03ms +33.81%
1 (8, 64, 64, 2048) (2048, 3, 3) 1 1 1 2048 18.48ms 11.90ms +35.61%
2 (8, 64, 64, 1024) (2048, 3, 3) 1 1 1 1024 29.35ms 14.29ms +51.31%
3 (8, 64, 64, 1024) (1024, 3, 3) 2 1 1 1024 5.74ms 3.33ms +41.99%
4 (8, 64, 64, 1024) (1024, 5, 5) 1 1 1 1024 21.47ms 18.20ms +15.23%
5 (8, 256, 256, 64) (64, 3, 3) 1 1 1 64 8.81ms 6.33ms +28.15%
6 (8, 64, 128, 2048) (2048, 3, 3) 1 12 12 2048 34.35ms 21.86ms +36.36%
7 (8, 64, 128, 2048) (2048, 3, 3) 1 24 24 2048 33.46ms 19.09ms +42.95%
8 (8, 64, 128, 2048) (2048, 3, 3) 1 36 36 2048 31.62ms 17.00ms +46.24%

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@CLAassistant
Copy link

CLAassistant commented Mar 16, 2021

CLA assistant check
All committers have signed the CLA.

@OuyangChao OuyangChao changed the title improve performance of DepthwiseConv(NWHC) improve performance of DepthwiseConv(NHWC) Mar 18, 2021
@zhangting2020
Copy link
Contributor

zhangting2020 commented Mar 31, 2021

Test the above cases on V100:

  • before:

    • forward:KernelDepthwiseConvSp + 2 * TilingSwapDim1And2
    • backward:KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TilingSwapDim1And2
      image
  • after

    • forward:KernelDepthwiseConvSp + TransposeNormalKernel
    • backward:KernelDepthwiseConvFilterGradSp + KernelDepthwiseConvInputGradSp + 2 * TransposeNormalKernel
      image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code here seems to cause an error when input_channels is not equal to the output_channels. We will add a case in unit tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it should be input_channels here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe why this change was made?

Copy link
Contributor Author

@OuyangChao OuyangChao Apr 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve gld_efficiency, filter_data was transposed from CHW to HWC in this PR. So weight in (h_f, w_f, c_out) should be const T* weight = filter_data + weight_offset * output_channels + c_out, in which weight_offset equals h_f * filter_width + w_f.

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zhangting2020 zhangting2020 merged commit 363b25a into PaddlePaddle:develop Apr 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants