Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
【问题】
fckernel 是基于cl::Buffer实现,性能不佳softmax在处理二维tensor时,性能不佳,原因是并行度很低,比如维度为 1x1000 的 tensor,axis=1,只分配了一个线程来计算【本PR工作】
fc,input/output/bias 使用cl::Image2d存储,weight 使用cl::Buffer存储,且 weight 的读取方式是half16,具体参见 [OpenCL][Kernel] Use FC replace conv1x1 #6365 ;对应单测支持 fp32/fp16 两种精度验证softmax,针对处理二维tensor时性能不佳的问题,调整线程分配方式为对 axis 轴所在的数据以32进行分块处理,因此使用了 local memory,核心思想是并行 reduce;同时为了高效处理channel非4整除情况,使用mask来避免使用if/else判断【效果】

MobileNetV1 模型中有一个
fc和一个softmax,在包含 mali 和 adreno gpu 6 个设备上测试 kernel 耗时,如下表(耗时单位 ms)。fc可提速 1 ~ 3 倍,softmax可提速 44% ~ 302%单独在 845 上测试不同N值下的 FC 性能:

【TODO】
由于这两个 kernel 的输出都是 2 维的,当对其输出 tensor 的维度扩充为 4 维时,不是按照 opencl converter 中定义的对高维度pad 1,而是对低维度 pad 1,因此对 precision profile 会有影响,待解决此处。后续计划统一将 opencl converter 改为对低维度 pad 1。