[LLM INFER] Optimize fuse some kernels in postprocess#9201
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9201 +/- ##
===========================================
- Coverage 52.92% 52.90% -0.03%
===========================================
Files 661 661
Lines 107069 106936 -133
===========================================
- Hits 56670 56571 -99
+ Misses 50399 50365 -34 ☔ View full report in Codecov by Sentry. |
| for (int i = tid; i < bad_words_length; i += blockDim.x) { | ||
| const int64_t bad_words_token_id = bad_words_list[i]; | ||
| if (bad_words_token_id >= length || bad_words_token_id < 0) continue; | ||
| logits_now[bad_words_token_id] = -1e10; |
There was a problem hiding this comment.
如果这里固定写了-1e10,那TypeName应该只能限定Float32或者Bfloat16,而不能传Float16。但算子注册的时候全都注册了,这存在溢出的风险。虽然目前通过组网强制cast(Float32),但容易被用户用错。
There was a problem hiding this comment.
这里可以修改为,根据传入的类型设置不同精度的初始值?
There was a problem hiding this comment.
我觉得比较合理的情况是,输入不同的类型都兼容下;但如果简单处理,也可以只考虑注册特定的精度的算子
PR types
Performance optimization
PR changes
Others
Description
1.get_padding_offset与remove_padding kernel fuse
2.stop_generation_multi_ends_v2与update_inputs kernel与前面的一些操作进行fuse
3.set_value_by_flags_and_idx_v2与set_stop_value_multi_ends_v2 kernel fuse
均增加测试代码,算子级别已对齐精度