Skip to content

Conversation

@thisjiang
Copy link
Contributor

PR types

Bug fixes

PR changes

OPs

Describe

问题

decoupledsegnet、hardnet单卡,batch_size=1训练报错,问题定位于PR32266

image
image

原因

cuda-memcheck排查发现在ElemwiseGradBroadcast2CUDAKernel处出现了非法内存访问导致出core。经分析原因在于Tensor实际分配大小与需求不一致,定位问题点在于SliceGradKernelneed_pad_num == 0判断有问题,直接跳过导致没有分配合适的空间,删去该判断逻辑后运行就正常了。

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

Copy link
Contributor

@wzzju wzzju left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@wzzju wzzju merged commit d8e238d into PaddlePaddle:develop Jul 21, 2021
@thisjiang thisjiang deleted the solve_slice_inplace_bug branch July 21, 2021 09:11
thisjiang added a commit to thisjiang/Paddle that referenced this pull request Jul 29, 2021
lanxianghit pushed a commit that referenced this pull request Jul 29, 2021
…#34473

SliceGradKernel的need_pad_num == 0判断有问题,直接跳过导致没有分配合适的空间,删去该判断逻辑后运行就正常了。
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants