Skip to content

Conversation

@Avin0323
Copy link
Contributor

@Avin0323 Avin0323 commented Apr 25, 2021

PR types

Others

PR changes

Others

Describe

relu支持bfloat16数据类型。

使用V100 + CUDA11.0 + cudnn8.0 测试如下:

  • float32计算:
    image
  • bfloat16计算并转换成float32:
    image

相对误差在0.01以下。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.


/* =========================== relu register ============================ */
#ifdef PADDLE_WITH_HIP
REGISTER_ACTIVATION_GPU_KERNEL(relu, Relu, ReluGPUFunctor, ReluGradGPUFunctor);
Copy link
Contributor

@AshburnLee AshburnLee Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

修改之前CUDA上会执行REGISTER_ACTIVATION_GPU_KERNEL,修改后只在HIP上执行了吗?

Copy link
Contributor Author

@Avin0323 Avin0323 Apr 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REGISTER_ACTIVATION_GPU_KERNEL宏逻辑为:

image

ROCM平台目前对于bfloat16编译存在问题,这里暂时先不在ROCM平台上支持。

# set delta as np.float16, will automatic convert to float32, float64
delta = np.array(delta).astype(np.float16)
elif tensor_to_check_dtype == core.VarDesc.VarType.BF16:
tensor_to_check_dtype = np.float32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

根据L127~L132,这里应该是将paddle的数据类型转换为对应的numpy的数据类型。所以core.VarDesc.VarType.BF16对应np.uint16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BF16类型也是属于浮点型,这里为了单测代码更易读,选择使用float32而不是uint16作为结果检查的类型。

@AshburnLee
Copy link
Contributor

Describe中可以提供下单测结果

@Avin0323 Avin0323 changed the title [WIP]relu supports bfloat16 data type relu supports bfloat16 data type Apr 26, 2021
@Avin0323
Copy link
Contributor Author

Describe中可以提供下单测结果

done

create_test_act_fp16_class(TestHardSwish)


#------------------ Test BF16 ----------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

删除这行注释吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"core is not compiled with CUDA")
class TestActBF16(parent):
def init_dtype(self):
self.dtype = np.uint16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设置dtype为np.uint16,怎么能保证不是用的int16类型的kernel呢?能不能用字符串"bfloat16"?现在是设置dtype=uint16实际都是BF16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前测试框架内使用np.uint16来测试bfloat16类型,其转换到paddle内为bfloat16类型,即设置dtype=np.uint16实际都是BF16

ops::ReluGradGradFunctor<plat::float16>>);
ops::ReluGradGradFunctor<plat::float16>>,
ops::ActivationDoubleGradKernel<plat::CUDADeviceContext,
ops::ReluGradGradFunctor<plat::bfloat16>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double grad验证过吗?如果没有的话,先不要注册。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单测中已验证。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注册代码后续PR中还是考虑简化一下。

ops::ActivationGradGPUKernel<plat::CUDADeviceContext,
ops::ReluGradGPUFunctor<plat::float16>>,
ops::ActivationGradGPUKernel<plat::CUDADeviceContext,
ops::ReluGradGPUFunctor<plat::bfloat16>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在relu调的是ActivationkernelVecActivationGradKernelVec这2个CUDA Kernel。核心计算在functor:

__device__ __forceinline__ typename CudaVecType<T>::type Compute(
const typename CudaVecType<T>::type in) {
// relu forward : out = max(x, 0)
return in > zero_ ? in : zero_;
}

CudaVecType<bfloat16>没有特化,CudaVecType<bfloat16>::Type依然的是bfloat16类型。bfloat16类型对于>运算符的重载,比较运算是先强制转换成float后再比较的吧?确认一下。

HOSTDEVICE inline bool operator>(const bfloat16& a, const bfloat16& b) {
return static_cast<float>(a) > static_cast<float>(b);
}

float16类型是有重载这些运算符的。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge最新代码后,这块的代码有一些改变,已经不再使用特化的形式了。

ops::MeanCUDAGradKernel<paddle::platform::CUDADeviceContext, plat::float16>,
ops::MeanCUDAGradKernel<paddle::platform::CUDADeviceContext,
plat::float16>);
plat::bfloat16>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fill_constant和mean为啥要注册bfloat16类型呢?是因为单测框架里面会用到吗?单测框架除了要测试的op外,其他op能不能都改成使用fp32类型?fill_constant倒没有问题,mean可能会引入较大的误差。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已取消fill_constant和mean的注册,改为增加转换成fp32步骤。

@paddle-bot-old
Copy link

paddle-bot-old bot commented May 4, 2021

Sorry to inform you that 7a26ab6's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

验证了一下op单测框架中构造的测试program:

  • double反向测试的program如下:
    image

  • bfloat16反向测试的program如下,只有relu和relu_grad是使用bfloat16计算:
    image

ops::ReluGradGradFunctor<plat::float16>>);
ops::ReluGradGradFunctor<plat::float16>>,
ops::ActivationDoubleGradKernel<plat::CUDADeviceContext,
ops::ReluGradGradFunctor<plat::bfloat16>>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注册代码后续PR中还是考虑简化一下。

ops::CastOpKernel<paddle::platform::CUDADeviceContext,
paddle::platform::complex64>,
ops::CastOpKernel<paddle::platform::CUDADeviceContext,
paddle::platform::complex128>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

注册代码后续PR中考虑简化一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的。

numpy_tensor = np.array(tensor).astype(np.uint16)
numpy_tensor = numpy_tensor.flatten()
return struct.unpack('<f', struct.pack('<I', numpy_tensor[i]
<< 16))[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是将uint16转换成bf16?后续可以封装成一个比较通用的函数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的。

no_grad_set)
fp32_grads = []
for grad in dygraph_grad:
if grad.dtype == np.uint16:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

语义上是bfloat16,最好字面上也是通过bfloat16判断,因为有一些op可能真的支持uint16类型的计算。或者至少要加一些注释。

下同。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

好的。

for grad in dygraph_grad:
if grad.dtype == np.uint16:
grad = convert_uint16_to_float(grad)
max_relative_error = 0.03
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

设大了不通过吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前与cpu上bf16的设置逻辑先保持一致。

Copy link
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for skip unittest

@Xreki Xreki merged commit bcd40f2 into PaddlePaddle:develop May 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants