Add Bfloat16 support on Ampere GPU with CUDA 11#32132
Add Bfloat16 support on Ampere GPU with CUDA 11#32132Xreki merged 16 commits intoPaddlePaddle:developfrom
Conversation
Update forked PaddlePaddle
Update my fork
update from PaddlePaddle
Update forked paddle repo
Update USERNAME/paddle
update Paddle USERNAME repo
update username repo
update local paddlepaddle
update paddlepaddle
|
Thanks for your contribution! |
Xreki
left a comment
There was a problem hiding this comment.
PR描述中截图贴出在Ampere架构GPU上单测的执行情况。
paddle/fluid/platform/bfloat16.h
Outdated
| // #ifdef __HIPCC__ | ||
| // #define PADDLE_CUDA_BF16 | ||
| // #include <hip/hip_bf16.h> | ||
| // #endif |
paddle/fluid/platform/bfloat16.h
Outdated
| HOSTDEVICE inline explicit bfloat16(const T& val) | ||
| : x(bfloat16(static_cast<float>(val)).x) {} | ||
|
|
||
| // Assignment operators |
There was a problem hiding this comment.
赋值运算符也需要添加下__nv_bfloat16类型的支持。
paddle/fluid/platform/bfloat16.h
Outdated
| // Arithmetic & Comparison operators on CUDA11 & Ampere-arch GPU | ||
| #if defined(__CUDACC__) && CUDA_VERSION >= 11000 && defined(__CUDA_ARCH__) && \ | ||
| __CUDA__ARCH__ >= 800 | ||
| DEVICE inline __nv_bfloat16 operator+(const __nv_bfloat16& a, |
There was a problem hiding this comment.
cuda11本身有定义了这些运算符,在cuda_bf16.hpp中。float16.h中是为了cuda9以下的版本重载的这些half运算符;cuda9以上工具链自己提供了,不需要在用户代码中定义。
There was a problem hiding this comment.
删除了cuda11中已定义运算符的单测
| distributed under the License is distributed on an "AS IS" BASIS, | ||
| WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| See the License for the specific language governing permissions and | ||
| limitations under the License. */ |
paddle/fluid/platform/CMakeLists.txt
Outdated
|
|
||
| IF(WITH_ROCM) | ||
| hip_test(float16_gpu_test SRCS float16_test.cu DEPS lod_tensor) | ||
| hip_test(bfloat16_gpu_test SRCS bfloat16_test.cu DEPS lod_tensor) |
| out[0] = in1[0] sign in2[0]; \ | ||
| } | ||
|
|
||
| #define ARITHMETIC_KERNEL_LAUNCH(op_type) \ |
There was a problem hiding this comment.
如果展开代码会相当长,且许多重复代码。
There was a problem hiding this comment.
可以通过封装公共函数、定义Functor来避免重复的代码。
Xreki
left a comment
There was a problem hiding this comment.
因后续工作依赖这个PR,故先合进去了。review提到的相关问题请再提PR修复。
| #if defined(__CUDACC__) && CUDA_VERSION >= 11000 | ||
| #define PADDLE_CUDA_BF16 | ||
| #include <cuda_bf16.h> | ||
| #endif |
There was a problem hiding this comment.
提醒一下:有一些宏定义,直接从float16.h里面抄过来了,如果哪个文件同时include了float16.h和bfloat16.h,估计会出现宏重复定义的错误。不是这个PR引入的,暂时不用处理。
| #include <iostream> | ||
| #include "paddle/fluid/framework/lod_tensor.h" | ||
|
|
||
| #if defined(PADDLE_CUDA_BF16) |
There was a problem hiding this comment.
其实这个单测不应该在只有PADDLE_CUDA_BF16定义了的情况下才执行,因为blfoat16.h和float16.h的实现是兼容所有的CUDA版本、GPU型号的,也就是原生不支持float16、bfloat16的CUDA版本、GPU型号,会自动转换成float计算。
| namespace paddle { | ||
| namespace platform { | ||
|
|
||
| TEST(bfloat16, convert_float32_to_bfloat16_on_gpu) { |
There was a problem hiding this comment.
CUDA Kernel都删掉了吗?那这些单测都是在CPU上执行的吧?下个PR把GPU的单测加回来吧。
| framework::TensorCopy(src_tensor, gpu_place, gpu_ctx, &gpu_tensor); | ||
|
|
||
| // GPU LoDTensor to CPU LoDTensor | ||
| framework::TensorCopy(gpu_tensor, CPUPlace(), gpu_ctx, &dst_tensor); |
There was a problem hiding this comment.
这个单测从CPU拷贝到GPU、GPU上什么都不做再拷贝回CPU,没什么意义。
PR types
Others
PR changes
Others
Describe
Add Bfloat16 support on Ampere GPU with CUDA 11. Below is the test result on RTX3090 CUDA 11.2:
All tests passed.