[NEW Feature] 新增基于hook的refined_recompute支持 by JunnYu · Pull Request #9396 · PaddlePaddle/PaddleNLP

JunnYu · 2024-11-08T06:54:09Z

PR types

New features

PR changes

APIs

Description

当前仅支持llama、qwen、qwen2模型
refined_recompute 支持 mlp_row_ln,attention_row_ln,attention_column_ln,mlp_column_ln,flash_attn这些算子，其中LoRA训练的时候不支持*_ln, 仅支持flash_attn
测试llama模型：meta-llama/Meta-Llama-3-8B.
当前refined_recompute，仅限在recompute_use_reentrant=False的时候生效，其他情况不生效。

1. 简单测试refined_recompute代码

paddle.seed(2024)
from paddle.nn.functional.flash_attention import flashmask_attention

dtype = "float16"
paddle.set_default_dtype(dtype)

in_weight_shape = (32, 3 * 2 * 32)
linear1 = paddle.nn.Linear(
    in_weight_shape[0],
    in_weight_shape[-1],
)
paddle.seed(2024)
in_weight = paddle.create_parameter(shape=in_weight_shape, dtype=dtype, name="in_weight")
in_weight.set_value(paddle.normal(0, 0.02, in_weight_shape))
in_weight.main_grad = paddle.normal(0, 0.02, in_weight.shape).cast("float32")
linear1.weight.set_value(in_weight)
in_bias = paddle.create_parameter(shape=(in_weight.shape[-1],), dtype=dtype, name="in_bias", is_bias=True)
in_bias.main_grad = paddle.normal(0, 0.02, in_bias.shape).cast("float32")
linear1.bias.set_value(in_bias)
linear1.weight.main_grad = in_weight.main_grad
linear1.bias.main_grad = in_bias.main_grad

out_weight_shape = (2 * 32, 32)
out_weight = paddle.create_parameter(shape=out_weight_shape, dtype=dtype, name="out_weight")
out_weight.set_value(paddle.normal(0, 0.02, out_weight_shape))
out_weight.main_grad = paddle.normal(0, 0.02, out_weight.shape).cast("float32")

class cus_multiply(paddle.autograd.PyLayer):
    @staticmethod
    def forward(ctx, a, b):
        y = paddle.multiply(a, b)
        ctx.save_for_backward(a, b)
        return y

    @staticmethod
    def backward(ctx, dy):
        a, b = ctx.saved_tensor()
        grad_a = dy * a
        grad_b = dy * b
        return grad_a, grad_b

multiply = cus_multiply.apply

def fwd(x, startend_row_indices, enable=True):
    def fwd_linear(x):
        weight = multiply(linear1.weight, linear1.weight * 0.1)
        bias = multiply(linear1.bias, linear1.bias * 0.1)
        qkv = paddle.nn.functional.silu(paddle.nn.functional.linear(x, weight, bias))
        q, k, v = paddle.chunk(qkv, 3, axis=-1)
        q = q.reshape([q.shape[0], q.shape[1], 2, q.shape[2] // 2])
        k = k.reshape([k.shape[0], k.shape[1], 2, v.shape[2] // 2])
        v = v.reshape([v.shape[0], k.shape[1], 2, v.shape[2] // 2])
        return q, k, v

    q, k, v = no_recompute(fwd_linear, x, enable=enable)

    q, k, v = q * q, k * k, v * v
    out = no_recompute(
        flashmask_attention,
        q,
        k,
        v,
        startend_row_indices=startend_row_indices,
        causal=True,
        enable=enable,
    )
    out = out.flatten(-2, -1)
    out = paddle.matmul(out, out_weight)
    return out

x = paddle.normal(0, 0.02, (1, 128, 32))
x.stop_gradient = False
x_input = x
startend_row_indices = paddle.randint(0, 128, (1, 2, 128, 1), dtype="int32")

enable = True
# 第一层
o1 = recompute(fwd, x, startend_row_indices, enable=enable)
# 第二层
o2 = recompute(fwd, o1 + x, startend_row_indices, enable=enable)
# 第三层
o3 = recompute(fwd, o2 + x, startend_row_indices, enable=enable)

o3.sum().backward()
print(x_input.grad.mean())
print(linear1.weight.grad.mean())
print(out_weight.grad.mean())

2. llama模型SFT 8k下精度对比（开启rr和关闭rr）

结论：10步的loss完全一致，精度一致，符合预期。

2.1 【精度】关闭 refined_recompute

2.2 【精度】开启 refined_recompute "flash_attn:-1"

3. llama模型SFT 8k下性能对比（开启rr和关闭rr）

结论：第二步ips, 1.1894 / 1.1636 = 1.022，约有 2.21%的提速

3.1 【性能】关闭 refined_recompute

3.2 【性能】开启 refined_recompute "flash_attn:-1"

4.测试PP精度, 对比不开recompute, 标准recompute，RR的recompute

export NVIDIA_TF32_OVERRIDE=0
export FLAGS_embedding_deterministic=1
export FLAGS_cudnn_deterministic=1

recompute=1
output_dir=ckpt_pp_with_rr

python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" --log_dir ${output_dir}/logs run_finetune.py \
    ./config/llama/sft_argument.json \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --max_length 1024 \
    --tensor_parallel_degree 2 \
    --pipeline_parallel_degree 4 \
    --virtual_pp_degree 4 \
    --sharding stage1 \
    --sharding_parallel_degree 1 \
    --pipeline_parallel_config "disable_partial_send_recv enable_delay_scale_loss enable_sharding_comm_overlap enable_overlap_p2p_comm" \
    --recompute ${recompute} \
    --recompute_granularity full \
    --refined_recompute "mlp_row_ln:-1,attention_row_ln:-1,attention_column_ln:-1,mlp_column_ln:-1,flash_attn:-1" \
    --sequence_parallel 1 \
    --zero_padding 1 \
    --use_flash_attention 1 \
    --max_steps 20 \
    --autotuner_benchmark 0 \
    --save_strategy no \
    --evaluation_strategy no \
    --output_dir ${output_dir}

5. llama 16k PostPretrain性能对比代码

速度提升约6~7%，当前由于没有添加fused head loss, 导致无法训练32k，64k配置，理论上提速能更多（超过10%）。

export NVIDIA_TF32_OVERRIDE=0

wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.bin
wget https://bj.bcebos.com/paddlenlp/models/transformers/llama/data/llama_openwebtext_100k.idx
mkdir -p data
mv llama_openwebtext_100k.bin ./data
mv llama_openwebtext_100k.idx ./data
cd ../slm/model_zoo/gpt-3/external_ops/ && python3 setup.py install && cd -

recompute=1
output_dir=ppt/ckpt_pp_w_rr_recompute


python -u  -m paddle.distributed.launch --gpus "0,1,2,3,4,5,6,7" --log_dir ${output_dir}/logs run_pretrain.py \
    ./config/llama/pretrain_argument.json \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --max_seq_length 16384 \
    --tensor_parallel_degree 1 \
    --pipeline_parallel_degree 1 \
    --virtual_pp_degree 1 \
    --sharding stage2 \
    --sharding_parallel_degree 8 \
    --recompute 1 \
    --recompute_granularity full \
    --dataloader_num_workers 4 \
    --recompute ${recompute} \
    --recompute_granularity full \
    --refined_recompute "flash_attn:-1" \
    --use_flash_attention 1 \
    --max_steps 20 \
    --autotuner_benchmark 0 \
    --save_strategy no \
    --evaluation_strategy no \
    --output_dir ${output_dir}

6 新增tp+sp与tp的对比

paddle-bot · 2024-11-08T06:54:14Z

Thanks for your contribution!

CLAassistant · 2024-11-08T06:54:15Z

All committers have signed the CLA.

codecov · 2024-11-08T07:26:54Z

Codecov Report

Attention: Patch coverage is 49.88290% with 214 lines in your changes missing coverage. Please review.

Project coverage is 52.93%. Comparing base (d6d181b) to head (418a259).
Report is 4 commits behind head on develop.

❗ Current head 418a259 differs from pull request most recent head 9f5e306

Please upload reports for the commit 9f5e306 to get more accurate results.

Files with missing lines	Patch %	Lines
paddlenlp/transformers/refined_recompute.py	56.41%	153 Missing ⚠️
paddlenlp/transformers/qwen/modeling.py	13.04%	20 Missing ⚠️
paddlenlp/transformers/llama/modeling.py	17.39%	19 Missing ⚠️
paddlenlp/transformers/qwen2/modeling.py	17.39%	19 Missing ⚠️
paddlenlp/transformers/llama/fusion_ops.py	25.00%	3 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9396      +/-   ##
===========================================
+ Coverage    52.84%   52.93%   +0.09%     
===========================================
  Files          688      689       +1     
  Lines       109378   109796     +418     
===========================================
+ Hits         57801    58121     +320     
- Misses       51577    51675      +98

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ZHUI · 2024-11-15T07:16:09Z

+pylayer_matmul = PyLayerMatmul.apply
+
+
+class BertConfig:


为什么不直接在bert中搞

主要为了测试，在bert中没有flash attn

DesmonDay · 2024-11-21T07:29:19Z

 except:
    flash_attention = None

+from paddlenlp.transformers.refined_recompute import no_recompute


为什么要叫no_recompute，感觉怪怪的

要么改成skip_recompute也行

recompute(func, xxxxx) vs no_recompute(func, xxxxxx)

ZHUI · 2024-11-25T07:36:20Z

再适配一下qwen模型吧。

JunnYu · 2024-11-25T08:16:47Z

@ZHUI 已经支持qwen和qwen2

wawltor · 2024-11-26T02:36:06Z

@@ -268,6 +268,14 @@ class LlmMetaConfig:
            "Recompute granularity, Choose among ['full', 'core_attn', 'full_attn']",
        ),
        ("recompute_use_reentrant", bool, False, "recompute_use_reentrant"),


这里的配置信息会传到下游任务里面吗？

需要 _set_unsavable_keys 吗？

不需要，这个zhonghui比较清楚用法，我看了一下实现可以满足需求。1是加了llmmetaclass，2是LlmMetaConfig.set_llm_config(model_config, training_args)
@DataClass
@llmmetaclass
@add_start_docstrings(TrainingArguments.doc)
class TrainingArguments(TrainingArguments):

wawltor · 2024-11-26T03:00:43Z

+        return output
+
+
+class RRRowSequenceParallelLinear(RowSequenceParallelLinear):


对于RowParallelLinear不用重写代码，但是RRRowSequenceParallelLinear需要重新写代码了？

当前没有支持非SequenceParallel的并行，当然也可以支持看看

DONE，已经支持

wawltor

LGTM

wawltor

LGTM

代码实现refined_recompute

62fc783

JunnYu added 5 commits November 11, 2024 14:50

更新rr的实现,新增单测测试pp和非pp

9d2632a

update llama and support refined recompute

56203a1

update rr

506c6bf

Merge branch 'develop' into add_refined_recompute

f394bbc

update

11075fc

JunnYu requested a review from DesmonDay November 15, 2024 06:50

ZHUI reviewed Nov 15, 2024

View reviewed changes

JunnYu changed the title ~~[Draft] 新增refined_recompute支持~~ [Draft] 新增基于hook的refined_recompute支持 Nov 18, 2024

DesmonDay marked this pull request as ready for review November 19, 2024 03:03

JunnYu changed the title ~~[Draft] 新增基于hook的refined_recompute支持~~ [NEW Feature] 新增基于hook的refined_recompute支持 Nov 19, 2024

JunnYu added 4 commits November 21, 2024 13:23

Merge branch 'develop' into add_refined_recompute

997cf5f

update create_skip_config_for_refined_recompute config.num_hidden_layers

1dad92d

update llama pp recompute

246a913

refined recompute only support recompute_use_reentrant=False

f348cc1

DesmonDay reviewed Nov 21, 2024

View reviewed changes

JunnYu added 2 commits November 21, 2024 18:43

LOD_TENSOR

9d34d2f

typo

da6c9cb

rr 支持qwen模型

6b6654d

Merge branch 'develop' into add_refined_recompute

ed4addb

wawltor reviewed Nov 26, 2024

View reviewed changes

JunnYu added 2 commits November 26, 2024 12:04

support RRColumnParallelLinear & RRRowParallelLinear

7b8d1c6

fix

7064804

wawltor previously approved these changes Nov 26, 2024

View reviewed changes

update llm test

b8671f1

JunnYu dismissed wawltor’s stale review via b8671f1 November 26, 2024 06:35

JunnYu added 3 commits November 26, 2024 14:38

Merge branch 'develop' into add_refined_recompute

44b2389

fix

418a259

update test_refined_recompute

9f5e306

wawltor approved these changes Nov 27, 2024

View reviewed changes

wawltor merged commit b1466d7 into PaddlePaddle:develop Nov 27, 2024

		return output


		class RRRowSequenceParallelLinear(RowSequenceParallelLinear):

Conversation

JunnYu commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR types

PR changes

Description

Uh oh!

paddle-bot Bot commented Nov 8, 2024

Uh oh!

CLAassistant commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented Nov 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZHUI commented Nov 25, 2024

Uh oh!

JunnYu commented Nov 25, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wawltor left a comment

Choose a reason for hiding this comment

Uh oh!

wawltor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JunnYu commented Nov 8, 2024 •

edited

Loading

CLAassistant commented Nov 8, 2024 •

edited

Loading

codecov Bot commented Nov 8, 2024 •

edited

Loading