Skip to content

Comments

Add LRN efficient GPU implement.#5894

Merged
gongweibao merged 8 commits intoPaddlePaddle:developfrom
gongweibao:lrngpu
Dec 6, 2017
Merged

Add LRN efficient GPU implement.#5894
gongweibao merged 8 commits intoPaddlePaddle:developfrom
gongweibao:lrngpu

Conversation

@gongweibao
Copy link
Contributor

Fix #5066

@gongweibao gongweibao changed the title Add effient GPU implement Add LRN efficient GPU implement. Nov 24, 2017
@gongweibao gongweibao requested a review from hedaoyuan November 24, 2017 09:22
template <typename T>
struct LRNFunctor<platform::CPUPlace, T> {
void operator()(const framework::ExecutionContext& ctx,
const framework::Tensor* input, framework::Tensor* out,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For input arguments: const framework::Tensor&

https://google.github.io/styleguide/cppguide.html#Reference_Arguments

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

const int end = start + n;

auto e_mid = framework::EigenTensor<T, 4>::From(*mid);
e_mid.device(ctx.GetEigenDevice<platform::CPUPlace>()) = e_mid.constant(k);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the CPU implementation of Eigen, there is no need to use .device().

e_mid.setConstant(k);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

Eigen::array<int, 4>({{1, 1, H, W}}));

s.device(ctx.GetEigenDevice<platform::CPUPlace>()) +=
alpha * r.square();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above:

s += alpha * r.square();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!


auto out_e = framework::EigenVector<T>::Flatten(*out);
out_e.device(ctx.GetEigenDevice<platform::CPUPlace>()) =
x_v * e_mid.reshape(Eigen::DSizes<int, 1>(e_mid.size())).pow(-beta);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same as above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

void operator()(const framework::ExecutionContext& ctx,
const framework::Tensor* x, const framework::Tensor* out,
const framework::Tensor* mid, framework::Tensor* x_g,
const framework::Tensor* out_g, int N, int C, int H, int W,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the input arguments, the same as above comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

T alpha, T beta) {
int img_size = N * H * W;
int block_size = 1024;
int grid_size = (img_size + 1024 - 1) / 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用block_size替代line 69中的1024.

int grid_size = (img_size + block_size - 1) / block_size;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!


int input_size = N * H * W * C;
block_size = 1024;
grid_size = (input_size + 1024 - 1) / 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,用block_size替代line 79中的1024.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

}
if (index >= size) {
accum -= in[(index - size) * step] * in[(index - size) * step];
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 41和line 44中,可以利用寄存器先保存global内存中的数据,这样可以避免多次访问globle内存:

       if (index < C) {
         T val = in[index * step];
         accum += val * val;
       }
       if (index >= size) {
         T val = in[index - size) * step];
         accum -= val * val;
       }


const auto& stream =
reinterpret_cast<const platform::CUDADeviceContext&>(ctx.device_context())
.stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

int img_size = N * H * W;

int block_size = 1024;
int grid_size = (img_size + 1024 - 1) / 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!
Thanks!

Copy link
Contributor

@qingqing01 qingqing01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gongweibao gongweibao merged commit c7e739f into PaddlePaddle:develop Dec 6, 2017
@gongweibao gongweibao deleted the lrngpu branch December 6, 2017 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants