The current implementation of the GPU kernel of the cross_entropy op with the soft label can be further optimized by using reduce: https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/cross_entropy_op.cu#L38 This problem is also known by @qingqing01