Skip to content
Merged
Changes from 12 commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
8f532b0
Merge pull request #1 from PaddlePaddle/develop
AshburnLee Sep 8, 2020
5b5804d
Merge pull request #2 from PaddlePaddle/develop
AshburnLee Sep 17, 2020
cee2470
Merge pull request #3 from PaddlePaddle/develop
AshburnLee Sep 30, 2020
5be3a45
Merge pull request #4 from PaddlePaddle/develop
AshburnLee Oct 13, 2020
a1d92b7
Merge pull request #5 from PaddlePaddle/develop
AshburnLee Oct 20, 2020
e674a5d
Merge pull request #6 from PaddlePaddle/develop
AshburnLee Nov 15, 2020
855d00b
Merge pull request #7 from PaddlePaddle/develop
AshburnLee Nov 18, 2020
20a37a8
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 15, 2021
82328a7
temporary PR for log_softmax
AshburnLee Mar 15, 2021
f6ece4d
Logsoftmax formard case#1: axis=-1
AshburnLee Mar 16, 2021
0f56b5e
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 16, 2021
4d5533b
Changed copyright
AshburnLee Mar 16, 2021
060953b
Made modifications according to PR reviewers
AshburnLee Mar 17, 2021
eb14185
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 17, 2021
302f08d
Dealt with unittest precision errors
AshburnLee Mar 18, 2021
844b880
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 18, 2021
26e1850
change launch cinfigure and code style
AshburnLee Mar 23, 2021
66c48ae
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 23, 2021
f2a2f2e
Removed header file cuda_runtime.h for HIP support
AshburnLee Mar 23, 2021
ab96a80
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 23, 2021
bf320c7
Modified code according to review comments
AshburnLee Mar 24, 2021
c5404ce
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Mar 24, 2021
c7d785e
Reply to review comments
AshburnLee Apr 8, 2021
480a52f
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Apr 8, 2021
0c1aec6
cudaStream_t -> gpuStream_t
AshburnLee Apr 9, 2021
24cd730
Merge branch 'develop' of https://github.com/PaddlePaddle/paddle into…
AshburnLee Apr 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
218 changes: 211 additions & 7 deletions paddle/fluid/operators/log_softmax_op.cu
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
// Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不是今年新增的,不用改copyright吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
Expand All @@ -12,15 +12,219 @@
// See the License for the specific language governing permissions and
// limitations under the License.

#include <cuda_runtime.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HIP上会找不到cuda_runtime.h,可以试试看删掉这个头文件应该也可以运行,或者写成

#ifdef __HIPCC__
#include <hip/hip_runtime.h>
#else
#include <cuda_runtime.h>
#endif

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#include <cassert>
#include <limits>
#include "paddle/fluid/operators/log_softmax_op.h"
#include "paddle/fluid/platform/cuda_device_function.h"

namespace paddle {
namespace operators {

#define WARP_SIZE 32
int log2_ceil(int value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名除一些类里面的setter、getter函数外,都采用AxxBxx这种命名方式,看一下Google C++代码规范

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


#define LAUNCH_SOFTMAX_WARP_FORWARD(L2E) \
case L2E: \
WarpLogSoftmaxForward<T, L2E><<<blocks, threads, 0>>>( \
dst, src, batch_count, softmax_elements_stride, softmax_elements); \
break;

template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 模板中变量名用AxxBxx这种驼峰式命名方式。
  • 这里WARP_BATCH应该是说一个warp负责计算几个batch吧,那不如直接叫NumBatchBatchSize

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

__device__ __forceinline__ void warp_reduce_sum(T* sum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

函数名应准确地表达函数的功能,函数命名也需要符合Google C++代码风格warp_reduce_sum -> BatchWarpReduceSum

下同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

#pragma unroll
for (int offset = WARP_SIZE_SOFTMAX / 2; offset > 0; offset /= 2) {
#pragma unroll
for (int i = 0; i < WARP_BATCH; ++i) {
T sum_val = platform::CudaShuffleXorSync(0xFFFFFFFF, sum[i], offset);
sum[i] = sum[i] + sum_val;
}
}
}

template <typename T, int WARP_BATCH, int WARP_SIZE_SOFTMAX>
__device__ __forceinline__ void warp_reduce_max(T* sum) {
#pragma unroll
for (int offset = WARP_SIZE_SOFTMAX / 2; offset > 0; offset /= 2) {
#pragma unroll
for (int i = 0; i < WARP_BATCH; ++i) {
T max_val = platform::CudaShuffleXorSync(0xFFFFFFFF, sum[i], offset);
sum[i] = max(sum[i], max_val);
}
}
}

template <typename T, int log2_elements>
__global__ void WarpLogSoftmaxForward(T* dst, const T* src, int batch_size,
int stride, int element_count) {
constexpr int next_power_of_two = 1 << log2_elements;
constexpr int KERNEL_WARP_SIZE =
(next_power_of_two < WARP_SIZE) ? next_power_of_two : WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / KERNEL_WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

常量变量名命名方式需统一,且符合Google C++编码规范里面的常量变量命名规范。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


int first_batch = (blockDim.y * blockIdx.x + threadIdx.y) * WARP_BATCH;
int local_batches = batch_size - first_batch;
if (local_batches > WARP_BATCH) local_batches = WARP_BATCH;

int local_idx = threadIdx.x;
src += first_batch * stride + local_idx;
dst += first_batch * stride + local_idx;

// 1.load data from global memory
T elements[WARP_BATCH][WARP_ITERATIONS];
int idx = threadIdx.x + blockDim.x * threadIdx.y;

for (int i = 0; i < WARP_BATCH; ++i) {
int batch_element_count = (i >= local_batches) ? 0 : element_count;
for (int it = 0; it < WARP_ITERATIONS; ++it) {
int element_index = local_idx + it * KERNEL_WARP_SIZE;
if (element_index < batch_element_count) {
elements[i][it] = src[i * element_count + it * KERNEL_WARP_SIZE];
} else {
elements[i][it] = -std::numeric_limits<T>::infinity();
}
}
}

// 2.compute max_value
T max_value[WARP_BATCH];
#pragma unroll
for (int i = 0; i < WARP_BATCH; ++i) {
max_value[i] = elements[i][0];
#pragma unroll
for (int it = 1; it < WARP_ITERATIONS; ++it) {
max_value[i] =
(max_value[i] > elements[i][it]) ? max_value[i] : elements[i][it];
}
}
warp_reduce_max<T, WARP_BATCH, KERNEL_WARP_SIZE>(max_value);

T sum[WARP_BATCH]{0.0f};
#pragma unroll
for (int i = 0; i < WARP_BATCH; ++i) {
#pragma unroll
for (int it = 0; it < WARP_ITERATIONS; ++it) {
sum[i] += std::exp(elements[i][it] - max_value[i]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float16的时候会有问题吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是因为__shfl_xor_sync&__shfl_xor不支持fp16。应该是可以处理的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,已处理。

}
}
warp_reduce_sum<T, WARP_BATCH, KERNEL_WARP_SIZE>(sum);

// 3.store result
#pragma unroll
for (int i = 0; i < WARP_BATCH; ++i) {
if (i >= local_batches) break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这种if语句分行写,并且都加上{}。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

sum[i] = std::log(sum[i]);
#pragma unroll
for (int it = 0; it < WARP_ITERATIONS; ++it) {
int element_index = local_idx + it * KERNEL_WARP_SIZE;
if (element_index < element_count) {
dst[i * element_count + it * KERNEL_WARP_SIZE] =
elements[i][it] - max_value[i] - sum[i];
} else {
break;
}
}
}
}

template <typename T>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

模板设置:

template <typename T, typename AccT>
void LaunchSoftmaxForwardForLastAxis(....) {
    ...
}

外层调用:LaunchSoftmaxForwardForLastAxis<T, MPTypeTrait<T>::Type>(...),即可解决模板调用中的double。MPTypeTrait的定义见:

template <typename T>
class MPTypeTrait {
public:
using Type = T;
};
template <>
class MPTypeTrait<platform::float16> {
public:
using Type = float;
};

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. 谢谢提供的解决方案!

void LogSoftmaxForwardAxisLast(T* dst, const T* src, int softmax_elements,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个函数主要的功能是启动CUDA Kernel,所以可以叫LaunchLogSoftmaxForwardForLastAxis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

int softmax_elements_stride, int batch_count) {
assert(softmax_elements >= 0 && softmax_elements <= 1024);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

检查用PADDLE_ENFORCE_XXX

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

if (softmax_elements == 0) {
return;
} else {
int log2_elements = log2_ceil(softmax_elements);
const int next_power_of_two = 1 << log2_elements;
int warp_size =
(next_power_of_two < WARP_SIZE) ? next_power_of_two : WARP_SIZE;
int batches_per_warp = (next_power_of_two <= 128) ? 2 : 1;

// use 128 threads per block to maximimize gpu utilization
constexpr int threads_per_block = 128;
int warps_per_block = (threads_per_block / warp_size);
int batches_per_block = warps_per_block * batches_per_warp;
int blocks = (batch_count + batches_per_block - 1) / batches_per_block;
dim3 threads(warp_size, warps_per_block, 1);

switch (log2_elements) {
LAUNCH_SOFTMAX_WARP_FORWARD(0); // 1
LAUNCH_SOFTMAX_WARP_FORWARD(1); // 2
LAUNCH_SOFTMAX_WARP_FORWARD(2); // 4
LAUNCH_SOFTMAX_WARP_FORWARD(3); // 8
LAUNCH_SOFTMAX_WARP_FORWARD(4); // 16
LAUNCH_SOFTMAX_WARP_FORWARD(5); // 32
LAUNCH_SOFTMAX_WARP_FORWARD(6); // 64
LAUNCH_SOFTMAX_WARP_FORWARD(7); // 128
LAUNCH_SOFTMAX_WARP_FORWARD(8); // 256
LAUNCH_SOFTMAX_WARP_FORWARD(9); // 512
LAUNCH_SOFTMAX_WARP_FORWARD(10); // 1024
default:
break;
}
}
}

template <typename DeviceContext, typename T>
struct LogSoftmaxCUDAFunctor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉这一层的封装没有必要。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

void operator()(const DeviceContext& context, const framework::Tensor* X,
framework::Tensor* Out, const int axis) {
int along_axis = (axis < 0) ? axis + X->dims().size() : axis;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CanonicalAxis已经对axis做了换算了。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done。已删除

int outer_size = 1;
const int dim_size = X->dims()[along_axis];
const auto* input_data = X->data<T>();
auto* output_data = Out->mutable_data<T>(context.GetPlace());

int inner_size = 1;
for (int i = 0; i < along_axis; i++) outer_size *= X->dims()[i];
for (int i = along_axis + 1; i < X->dims().size(); i++)
inner_size *= X->dims()[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SizeToAxis和SizeFromAxis可以分别计算outer_size和inner_size

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

outer_size可以用SizeToAxis()得到;inner_size的计算与SizeFromAxis()有差别。这里应该调SizeOutAxis()。但是SizeOutAxis()定义在其他.cu文件中,在该文件中不能直接调用。(nvcc 没有开启 --relocatable-device-code=true --compile,开启后可以调用)。

所以保留inner_size,用SizeToAxis()获得outer_size。

assert(X->numel() > 0);
assert(inner_size == 1 dim_size <= 1024 && dim_size * sizeof(T) <= 4096);
LogSoftmaxForwardAxisLast<T>(output_data, input_data, dim_size, dim_size,
outer_size);
}
};

template <typename T>
class LogSoftmaxKernel<platform::CUDADeviceContext, T>
: public framework::OpKernel<T> {
public:
void Compute(const framework::ExecutionContext& context) const override {
const auto* X = context.Input<framework::Tensor>("X");
auto* Out = context.Output<framework::Tensor>("Out");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名命名:axx_bxx

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

变量名都改为了这种形式。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

X、Out还没改。

const int rank = X->dims().size();
const int axis = CanonicalAxis(context.Attr<int>("axis"), rank);

int dim_size = X->dims()[axis];
int inner_size = 1;
for (int i = axis + 1; i < X->dims().size(); i++)
inner_size *= X->dims()[i];

Out->mutable_data<T>(context.GetPlace());
// execute CUDA kernel
if (inner_size == 1 && dim_size <= 1024) {
LogSoftmaxCUDAFunctor<platform::CUDADeviceContext, T>()(
context.template device_context<platform::CUDADeviceContext>(), X,
Out, axis);
} else {
// execute Eigen kernel
LogSoftmaxFunctor<platform::CUDADeviceContext, T>()(
context.template device_context<platform::CUDADeviceContext>(), X,
Out, axis);
}
}
};

} // operators
} // paddle

namespace ops = paddle::operators;
namespace plat = paddle::platform;
REGISTER_OP_CUDA_KERNEL(
log_softmax, ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, plat::float16>);
REGISTER_OP_CUDA_KERNEL(log_softmax,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, float>,
ops::LogSoftmaxKernel<plat::CUDADeviceContext, double>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么把float16类型去掉了?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done,支持了float16。

REGISTER_OP_CUDA_KERNEL(
log_softmax_grad, ops::LogSoftmaxGradKernel<plat::CUDADeviceContext, float>,
ops::LogSoftmaxGradKernel<plat::CUDADeviceContext, double>,
ops::LogSoftmaxGradKernel<plat::CUDADeviceContext, plat::float16>);
ops::LogSoftmaxGradKernel<plat::CUDADeviceContext, double>);