Add CTC align op#7527
Conversation
| auto stream = ctx.cuda_device_context().stream(); | ||
| ArgmaxCudaKernel<T, PADDLE_CUDA_NUM_THREADS><<< | ||
| num_tokens, PADDLE_CUDA_NUM_THREADS, 0, stream>>>(seq_width, logits, | ||
| tokens); |
There was a problem hiding this comment.
这个Kernel是在计算top 1吗?如果是可以调用top_k_op的实现吧~
There was a problem hiding this comment.
You are right. I will remove argmax content from both CPU kernel and GPU kernel.
|
Please create an issue and add it to https://github.com/PaddlePaddle/Paddle/projects/39 |
| AddInput("Input", | ||
| "(LodTensor, default: LoDTensor<float>), the unscaled " | ||
| "probabilities of variable-length sequences, which is a 2-D " | ||
| "Tensor with LoD information. It's shape is " |
| result = [] | ||
| for token in np.argmax(softmax, axis=1): | ||
| if (token != blank) and not (merge_repeated and token == prev_token): | ||
| result.append(token) |
There was a problem hiding this comment.
Should there be one line prev_token = token?
| def test_check_output(self): | ||
| self.check_output() | ||
|
|
||
|
|
There was a problem hiding this comment.
Please add another test case for merge_repeated = False
| CTCGreedyDecodeOpMaker(OpProto* proto, OpAttrChecker* op_checker) | ||
| : OpProtoAndCheckerMaker(proto, op_checker) { | ||
| AddInput("Input", | ||
| "(LodTensor, default: LoDTensor<float>), the unscaled " |
| "merge repeated elements between two blanks. ") | ||
| .SetDefault(true); | ||
| AddComment(R"DOC( | ||
| CTCGreedyDecoder is an implementation of the simple best path decoding |
There was a problem hiding this comment.
Need more detailed document here
1. Remove 'top 1'(or argmax) from CPU and GPU kernel 2. Add a new test case 3. Refine doc
|
Please keep the name |
| auto stream = ctx.cuda_device_context().stream(); | ||
| MergeAndDelCudaKernel<T><<<1, 1, 0, stream>>>( | ||
| num_tokens, tokens, num_seq, input_lod[level].data(), blank, | ||
| merge_repeated, dev_out_lod0_ptr, output_data); |
There was a problem hiding this comment.
The CUDA kernel is less efficient here. We can profile the speed when training the model. Then determine whether to delete the GPU kernel in this op and editing distance op.
1. Allocate memory for output before compute. 2. Rename 'ctc_decode' to 'ctc_align'
a1cdeb0 to
6089b50
Compare
|
Have removed debug code. |
No description provided.