Add INT8 support for fused_multi_transformer_op #45284

RichardWooSJTU · 2022-08-19T13:28:30Z

PR types

New features

PR changes

OPs

Describe

Add fused_multi_transformer_int8 op to support quantization inference without trt. The reason of quantization inference using native inference instead of trt is two tensorRt engines introduced by while op cannot share weights which cause double GPU memory. With native inference, we can manage weights flexibly, but the inference performance is slightly inferior compared with trt. To gain better performance, we made the following attempts:

Use cublasLt int8 gemm with imma kernel as backend of fc layer.
Modify AttnLayernorm/FusedDropoutHelper/FusedDropoutLayerNormHelper to coexist with quantization/de-quantization to speed up inference. We reuse the original classes and functions while adding 2 typenames (InType, OutType) to indicate whether using quantization or dequantization, 3 arguements (quant_out_scale_data, quant_out_scale_offset, quant_in_scale_data) to help Q/DQ. Some notes:
a. We define the above 5 arguements with default values, which means we didn't need to modify the existed references.
b. The above changes only apply to dropout-rate != 1.0 and the 3 classes.
c. Only pre-layernorm is fully tested.

Some limitations by now:

The Inference model generated by paddle.static.save_inference_model() directly (which is not be fused and whose weight is not be quantized) cannot be used by this function. We are working on it and it will be supported soon.
Two optimizations can be applied further to enhance the performance (by might only a small margin):
a. Batched GEMM are not quantized.
b. quant/dequant is explictly called in the QKV GEMM. Fusion into prev/after might be useful.

…hardWooSJTU/Paddle into fused_multi_transformrt_int8 merge fuse kernel

paddle-bot · 2022-08-19T13:28:35Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

qingqing01 · 2022-08-22T03:24:48Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+}
+
+template <typename T>
+void quantize_kernelLauncher(const T* input,


quantize_kernel_launcher

qingqing01 · 2022-08-26T02:30:55Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+    tmp.w =
+        __float2int_rn(static_cast<float>(input[m_id * n + n_id + 3]) * scale);
+    output[(m_id * n + n_id) >> 2] = tmp;
+  }


m，n较大时一个线程只计算4个可能效率会低，最好这里考虑通用性

qingqing01 · 2022-08-26T02:31:09Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+}
+
+template <typename T>
+void quantize_kernelLauncher(const T* input,


命名不规范

qingqing01 · 2022-08-26T02:31:20Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+  dim3 block(32, 32);
+
+  quantize_kernel<<<grid, block, 0, stream>>>(
+      input, (char4*)output, scale, m, n);  // NOLINT


同上m、n较大时这里不高效

qingqing01 · 2022-08-26T02:31:56Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+  if (check) {
+    float out_scale = quant_out_scale_data[layer_offset + m_id];
+    output[n_id * m + m_id] =
+        static_cast<T>(static_cast<float>(input[n_id * m + m_id]) * out_scale);


qingqing01 · 2022-08-26T02:32:44Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+                                  const int m,  // hidden
+                                  const int n,  // batch size
+                                  const float* quant_out_scale_data,
+                                  const int layer_offset) {


layer_offset命名不直观

rename to: quant_out_scale_offset

qingqing01 · 2022-08-26T02:32:52Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+                                                hidden_units,
+                                                batch_size,
+                                                quant_out_scale_data,
+                                                layer_offset);


问题同上

qingqing01 · 2022-08-26T02:33:34Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+    auto helper = std::make_shared<CublasLtHelper>(m, k, n);
+    helpers_.emplace_back(helper);
+  }
+  ~AttnMatmulINT8() {}


这里命名INT8的话，上面Q命名也改成INT8

DONE.

As for the fused layernorm-quantization kernel, I am still trying to git rid of the redundant code.

qingqing01 · 2022-08-26T02:43:05Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+
+  void ComputeForward(
+      const framework::Tensor*
+          weight,  // [int8] which has been transformed in pass


这里格式有点乱

review了部分，后续清理后在review~

Removed some useless comments.

…hardWooSJTU/Paddle into fused_multi_transformrt_int8

…hardWooSJTU/Paddle into fused_multi_transformrt_int8 merge minghao

fix error

code clean

…ync_params_among_devices pass

wanghaoshuang · 2022-09-15T02:59:10Z

paddle/fluid/operators/fused/quant_dequant_kernel.h

+namespace operators {
+
+template <typename T>
+__forceinline__ __device__ int8_t clip_round(const T input, const float scale) {


为什么不直接叫"quant"呢？clip_round并不能完整代表该方法的功能

已修改为quant_helper

wanghaoshuang · 2022-09-15T03:07:12Z

paddle/fluid/operators/fused/quant_dequant_kernel.h

+  float quant_value = 127.0f * (1.0f / scale) * static_cast<float>(input);
+  quant_value = static_cast<float>(round(quant_value));
+  quant_value = quant_value > 127.0f ? 127.0f : quant_value;
+  quant_value = quant_value < -127.0f ? -127.0f : quant_value;


这里确认下是否应该改成-128

参考这里：https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/operators/fake_quantize_op.cu.h#L240-L243

这里与fake_quant_abs_max_op对齐，确认是-127
参考：

Paddle/paddle/fluid/operators/fake_quantize_op.cu.h

Line 245 in b042a3b

} else {

同时为了部分解耦，clip不直接hard code 127.0f而是使用max_bound/min_bound参数，该参数为op的属性

wanghaoshuang · 2022-09-15T03:22:17Z

paddle/fluid/operators/fused/quant_dequant_kernel.h

+                                  const float quant_in_scale,
+                                  const float* quant_out_scale_data,


dequant理论上乘一个dequant_scale就行了，其中，dequant_scale = intput_scale * weight_scale
quant_in_scale和quant_out_scale各是什么意思？

quant_in_scale 为op 属性中的input scale，与PTQ导出的scale意义相同
quant_out_scale现改名为dequant_out_scale，为op输入中的output scale，定义与fake_dequant_range_abs_max op中的max_range属性意义相同。
op的属性、输入的定义在fused_multi_transformer_int8_op.cc中进行了文字说明。

wanghaoshuang · 2022-09-15T03:22:35Z

paddle/fluid/operators/fused/quant_dequant_kernel.h

+    float out_scale = quant_out_scale_data[quant_out_scale_offset + m_id];
+    output[n_id * m + m_id] =
+        static_cast<T>(static_cast<float>(input[n_id * m + m_id]) *
+                       quant_in_scale / out_scale);


看起来是dequant+quant?

为了与fake_dequant_abs_max op对齐以提高精度的暂时的操作，参考：

Paddle/paddle/fluid/operators/fake_dequantize_op.cu.h

Line 29 in 92e1f64

out[i] = in[i] * scale[0] / max_range;

wanghaoshuang · 2022-09-15T03:25:43Z

paddle/fluid/operators/fused/quant_dequant_kernel.h

+                                const int hidden_units,  // n
+                                cudaStream_t stream,
+                                const float quant_in_scale,
+                                const float* quant_out_scale_data,


为什么quant_out_scale_data会有多个数呢？channel-wise dequant?

是channel-wise dequant，也可以兼容layer-wise dequant，每个channel值相同即可

wanghaoshuang · 2022-09-15T03:29:46Z

paddle/fluid/operators/fused/attention_layer_norm.h

+                      LayerNormParamType<T>* var_data,
+                      const float* quant_out_scale_data = nullptr,
+                      const int quant_out_scale_offset = 0,
+                      const float quant_in_scale_data = 1.0) {


如果不是指针，直接命名为quant_in_scale?

wanghaoshuang · 2022-09-15T03:41:16Z

paddle/fluid/operators/fused/fused_multi_transformer_int8_op.cu

+    auto ffn2_in_scale = ctx.Attr<std::vector<float>>("ffn2_in_scale");
+
+    // output scales, tensor, size = [num_layers, n], n is gemm output size
+    auto *qkv_out_scale = ctx.Input<Tensor>("QKVOutScale");


这个scale是1/weights_scale么？

在量化旧格式中，将weight scale存在了out scale中，这么做不太合理

不应该将格式的约束带入推理实现中，换句话说，推理的实现应该独立于量化模型格式。

应该通过pass来解耦量化格式和推理实现。在Pass中，拿到dequant需要的所有信息，并计算出dequant scales。只需要向推理Deqaunt Operator传递dequant scales。

heavengate · 2022-09-15T04:02:45Z

paddle/fluid/operators/fused/attn_gemm_int8.h

+                      const float quant_in_scale_data,
+                      const framework::Tensor* quant_out_scale,
+                      const int quant_out_scale_offset) {
+    int m = m_, k = k_, n = n_;


这里的m n k好像没有被用到

heavengate · 2022-09-15T04:05:05Z

paddle/fluid/operators/fused/cublaslt.h

+            const int8_t* B_dev,
+            int32_t* C_dev,
+            cudaStream_t stream) {
+    // PADDLE_ENFORCE_GPU_SUCCESS(cudaDeviceSynchronize());


确认下是否需要，不需要的话可以删除

…hardWooSJTU/Paddle into fused_multi_transformrt_int8

Aurelius84

LGTM for data registeration

XieYunshen

LGTM
单测时间设置

XiaoguangHu01

LGTM

Co-authored-by: RichardWooSJTU <[email protected]>

minghaoBD and others added 5 commits August 18, 2022 22:43

add fused_multi_transformer_int8_op and unittest

6a84086

fuse kernel

139ec0d

fix build error

51315d6

Merge branch 'fused_multi_transformrt_int8' of https://github.com/Ric…

5674acc

…hardWooSJTU/Paddle into fused_multi_transformrt_int8 merge fuse kernel

fix fuse kernel

ce00ad9

qingqing01 requested review from heavengate, minghaoBD, qingqing01 and wanghaoshuang August 22, 2022 03:08

qingqing01 reviewed Aug 26, 2022

View reviewed changes

minghaoBD and others added 9 commits August 30, 2022 18:39

Merge branch 'PaddlePaddle:develop' into fused_multi_transformrt_int8

d5b00da

rename kernels and skip UT on non-GPU platforms

df2d3a0

Merge branch 'fused_multi_transformrt_int8' of https://github.com/Ric…

30b2c3d

…hardWooSJTU/Paddle into fused_multi_transformrt_int8

add layer API for create model

44e1e46

Merge branch 'fused_multi_transformrt_int8' of https://github.com/Ric…

ef3ef70

…hardWooSJTU/Paddle into fused_multi_transformrt_int8 merge minghao

clean debug code

c42512a

code clean

f26e42b

fix error

Merge pull request #1 from RichardWooSJTU/tmp_branch

cb31f82

code clean

Merge branch 'develop' into fused_multi_transformrt_int8

1422b17

paddle-bot-old bot added the contributor External developers label Sep 13, 2022

resolve conflicts and fix UT bugs.

2a967fd

minghaoBD force-pushed the fused_multi_transformrt_int8 branch from 0c59ac2 to 2a967fd Compare September 14, 2022 02:05

RichardWooSJTU added 5 commits September 14, 2022 11:58

skip unnit test in cpu and skip cast when tensor has been casted in s…

07fc384

…ync_params_among_devices pass

modify input_scale and output_scale to align fake quant/dequant op

117f588

skip windows unittest

4392807

modify mutable_data to gpucontext.Alloc

dfa79a3

fix CI-ROCM error and add quantization argument description

9323569

wanghaoshuang reviewed Sep 15, 2022

View reviewed changes

heavengate reviewed Sep 15, 2022

View reviewed changes

RichardWooSJTU and others added 2 commits September 15, 2022 17:38

add branch to decoupling roundtype and clip type

a6038ba

Delete .python-version

9672af1

paddle-bot-old bot removed the contributor External developers label Sep 15, 2022

RichardWooSJTU added 5 commits September 16, 2022 11:03

fix dyload error and disable algo select with cuda10.2

4794a24

Merge branch 'fused_multi_transformrt_int8' of https://github.com/Ric…

2524c63

…hardWooSJTU/Paddle into fused_multi_transformrt_int8

fix ci problem

d134702

fix unittest timeout setting in ci-rocm

991fff6

delete api related codes in fp16utils

c1c22c6

Aurelius84 approved these changes Sep 17, 2022

View reviewed changes

XieYunshen approved these changes Sep 17, 2022

View reviewed changes

qingqing01 approved these changes Sep 17, 2022

View reviewed changes

YuanRisheng approved these changes Sep 18, 2022

View reviewed changes

XiaoguangHu01 approved these changes Sep 18, 2022

View reviewed changes

wanghaoshuang merged commit 3d7e211 into PaddlePaddle:develop Sep 18, 2022

minghaoBD mentioned this pull request Sep 18, 2022

[cherry-pick 2.4] Add INT8 support for fused_multi_transformer_op (#45284) #46169

Merged

qingqing01 pushed a commit that referenced this pull request Sep 19, 2022

Add INT8 support for fused_multi_transformer_op (#45284) (#46169)

db368d5

Co-authored-by: RichardWooSJTU <[email protected]>

		const float quant_in_scale,
		const float* quant_out_scale_data,

Add INT8 support for fused_multi_transformer_op #45284

Add INT8 support for fused_multi_transformer_op #45284

Uh oh!

Conversation

RichardWooSJTU commented Aug 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Describe

Uh oh!

paddle-bot bot commented Aug 19, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qingqing01 Aug 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanghaoshuang Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RichardWooSJTU Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RichardWooSJTU Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RichardWooSJTU Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wanghaoshuang Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

RichardWooSJTU commented Aug 19, 2022 •

edited

Loading

qingqing01 Aug 26, 2022 •

edited

Loading

wanghaoshuang Sep 15, 2022 •

edited

Loading

RichardWooSJTU Sep 15, 2022 •

edited

Loading

RichardWooSJTU Sep 15, 2022 •

edited

Loading

RichardWooSJTU Sep 15, 2022 •

edited

Loading

wanghaoshuang Sep 15, 2022 •

edited

Loading