Skip to content

[AutoParallel] add sharding tensor_fusion save load switch#9810

Merged
ZHUI merged 6 commits into
PaddlePaddle:developfrom
AndSonder:tensor_fusion_save_load
Feb 5, 2025
Merged

[AutoParallel] add sharding tensor_fusion save load switch#9810
ZHUI merged 6 commits into
PaddlePaddle:developfrom
AndSonder:tensor_fusion_save_load

Conversation

@AndSonder
Copy link
Copy Markdown
Contributor

Before submitting

  • Lint code. If there are lint issues, please format the code first.
# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py
  • Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Others

Description

我们需要支持:开 tensor_fusion 的情况下既可以从均匀的参数 load、也可以从非均匀的参数 load,存的时候既可以存均匀的参数,也可以存非均匀的参数

为了满足上述需求,需要把开关从 state_dict 和 set_state_dict 里面剥离出来,放在 PaddleNLP 端
paddlenlp 端新加参数,save_model_with_sharding_tensor_fusion 和 load_model_with_sharding_tensor_fusion

当使用 tensor_fusion 的时候:

  1. save_model_with_sharding_tensor_fusion 为 True 表示 Save 非均衡模型,为 False 表示 Save 均衡模型
  2. load_model_with_sharding_tensor_fusion 为 True 表示 Load 非均衡模型,为 False 表示 Load 均衡模型

相关 pr:

@codecov
Copy link
Copy Markdown

codecov Bot commented Jan 22, 2025

Codecov Report

Attention: Patch coverage is 21.05263% with 30 lines in your changes missing coverage. Please review.

Project coverage is 52.21%. Comparing base (3967f76) to head (fee929a).
Report is 331 commits behind head on develop.

Files with missing lines Patch % Lines
paddlenlp/trainer/auto_trainer.py 0.00% 29 Missing ⚠️
paddlenlp/trainer/auto_training_args.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #9810      +/-   ##
===========================================
- Coverage    52.43%   52.21%   -0.22%     
===========================================
  Files          731      730       -1     
  Lines       116411   115828     -583     
===========================================
- Hits         61037    60482     -555     
+ Misses       55374    55346      -28     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread paddlenlp/trainer/training_args.py Outdated
)

load_model_with_sharding_tensor_fusion: bool = field(
default=True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开关是否默认为False更合理一些?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个开关是否默认为False更合理一些?

True 表示用非均衡的 save&load 方式,不会引入显存增加,默认用 True 会好一些

Comment thread paddlenlp/trainer/training_args.py Outdated

@property
def should_load_sharding_tensor_fusion_balanced_model(self):
if not self.enable_auto_parallel:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这条判断是不是和下边的合并就好了?

Copy link
Copy Markdown
Contributor Author

@AndSonder AndSonder Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这条判断是不是和下边的合并就好了?

是的,下一个 commit 修改

Done 37b510c

Comment thread paddlenlp/trainer/training_args.py Outdated
logger.debug("")

@property
def should_load_sharding_tensor_fusion_balanced_model(self):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命名是否和暴露给用户的开关load_model_with_sharding_tensor_fusion保持一致好一些?

Copy link
Copy Markdown
Contributor Author

@AndSonder AndSonder Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命名是否和暴露给用户的开关load_model_with_sharding_tensor_fusion保持一致好一些?

如果只用 should_load_model_with_sharding_tensor_fusion 没办法区分开 load 均衡/非均衡的

should_load_model_with_sharding_tensor_fusion 为 False 也有可能是没开 tensor_fusion, 没开 tensor_fusion 的时候不需要 convert

只有开了 tensor_fusion 才需要考虑是否需要做 state_dict 的 convert,目前这样写法感觉清晰一些,这个属性不对用户暴露

if self.args.should_load_sharding_tensor_fusion_unbalanced_model:
    xxx
if self.args.should_load_sharding_tensor_fusion_balanced_model:
    xxx

From00
From00 previously approved these changes Jan 24, 2025
Copy link
Copy Markdown
Collaborator

@From00 From00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment thread paddlenlp/trainer/training_args.py Outdated
},
)

load_model_with_sharding_tensor_fusion: bool = field(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原则上,专用的参数建议放到 auto training args 里面。一定要放到 paddlenlp/trainer/training_args.py 基类里面的话,参数命名加上 auto parallel之类的前缀。现在这样非常容易跟自动并行搞混掉。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原则上,专用的参数建议放到 auto training args 里面。一定要放到 paddlenlp/trainer/training_args.py 基类里面的话,参数命名加上 auto parallel之类的前缀。现在这样非常容易跟自动并行搞混掉。

Done 0d72fea

@AndSonder AndSonder requested a review from ZHUI January 25, 2025 12:39
@ZHUI ZHUI merged commit 54b8882 into PaddlePaddle:develop Feb 5, 2025
ckl117 pushed a commit to ckl117/PaddleNLP that referenced this pull request Feb 17, 2025
update 0113

support head_dim=192,256 for append_attn c16

attention run

refine code

add softmax_scale

support weight_only_int8

refine code

support tp

delete test_append_attn

add splited fused_moe from ziyuan

add deepseek-v3 class

fix repe for deepseek-v3

fix wint8 precision and refine code

fix wint4, big diff

add e_score_correction_bias

fix head_dim

fix v3 verify

[AutoParallel] open tensor_fusion for benchmark (PaddlePaddle#9749)

* open tensor_fusion for benchmark

fix loraga merge (PaddlePaddle#9765)

* fix loraga merge

* change sign

Fix ernie ci auto trainer error (PaddlePaddle#9758)

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* [AutoParallel]:fix ernine auto_trainer error

* Update run_pretrain_auto.py

Update README.md (PaddlePaddle#9766)

* Update README.md

[BugFix] Fix matryoshka norm loss (PaddlePaddle#9774)

* fix matryoshka norm

[Distributed] support fuse optimizer (PaddlePaddle#9519) (PaddlePaddle#9777)

Update register_sequence_parallel_allreduce_hooks (PaddlePaddle#9782)

* fix sequence parallel

* update register_sequence_parallel_allreduce_hooks

* update fuse_sequence_parallel_allreduce

Fix ce error (PaddlePaddle#9783)

* [AutoParallel]:fix ci error

* [AutoParallel]:fix ci error

fix (PaddlePaddle#9779)

[MoE] fix expert parallel (PaddlePaddle#9760)

* fix moe uc

fix dpo pp criterion (PaddlePaddle#9786)

[Infer] Add pir_model path for server infer. (PaddlePaddle#9790)

fix d2s

fix v3 verify

support qk_head_dim != v_head_dim

support fp8 batch gemm on cutlass3.x

upgrade cutlass version for block_wise fp8 gemm

change cutlass commit to ckl117 group_wise branch

support fp8 block gemm, but private cutlass commit, and TODO: update fp8 dual gemm api on cutlass3.x

support auto tune fp8 block gemm code

update cutlass to v3.7.0, todo: support block gemm based on v3.7.0

support block gemm on cutlass v3.7.0 commit

code check

code check

check dynamic_quant

ad block builder dir

rename group_quant

fix wint8 v_head_dim

fix rope

fix qwen2

mla use position_ids only

remove control flow

remove gpu concat

fix norm weight dtype

remove all_reduce in fused_moe

part support fp8

check group_quant and fake fp8

check

support block gemm

[LLM] support flash device on static model (PaddlePaddle#9619) (PaddlePaddle#9787)

* [LLM] support flash device on static model

* [LLM] adapt pdc sdk

[LLM Benchmark]update scripts (PaddlePaddle#9722)

* add no_proxy & del paddlenlp_ops

* update timeout for dpo

* fix sequence_parallel

* add timeout

* add Total_Tokens_per_second_per_gpu

* fix Tokens_per_second_per_gpu

* update Total_Tokens_per_second_per_gpu

mergekit gpu 1226 (PaddlePaddle#9702)

* mergekit gpu 1226

* merge model gpu

* merge gpu

* add lora model

* change valueerror

* add lora

* gpu test

[LLM] merge code from fastdeploy (PaddlePaddle#9791)

* [LLM] update llm server dockerfiles

* merge code from fastdeploy

[Inference] Support eagle for llama (PaddlePaddle#9812)

[CI] Fix ci of small models (PaddlePaddle#9633)

[Trainer] Wrap model when lora is ON and only do evaluation. (PaddlePaddle#9803)

[README] Update README.md for documention (PaddlePaddle#9785)

* Update README.md

* Update README.md

* Update README_en.md

fix static run

wint8 and fake-fp8, todo: support data type does not match

support fp8, but ffn1 and moe in wint8

support ffn1 fp8 block gemm

done ffn1 fp8 block gemm

block gemm done

block gemm support batch

refine rope code

compute position_ids use custom op

fix split_param (PaddlePaddle#9817)

[LLM] Update model convert and fix TP for deepseekv3 (PaddlePaddle#9797)

* fix model convert and tp in MoEMLP

* fix tp_action filter

* update convert accoding to num_nextn_predict_layers

* add deepseek-R1

fuse rope

fix macro

fix mixtral

set_state_dict block_wise weight

support fp8 per tensor network, no support scale Tensor for tensor gemm

deepseek-v3 fp8 tensor gemm network, but precision fault

add triton fp8 fused_moe kernel

fix moe triton kernel

add moe triton kernel

fix

fix fp8 block gemm precision

moe triton fp8 network

support moe triton and precision correct, but shared ffn1 ffn2 incorrect

fp8 block network, no check shared ffn1-ffn2 in v2-lite

delete wint8 in fake

delete some useless code and verify per tensor net with in qkv outlinear ffn1 ffn2, but triton moe don't match api

fp8 block quant when load model, and code check

fix tokenizer and qwen

[AutoParallel] add sharding tensor_fusion save load switch (PaddlePaddle#9810)

* support tensor_fusion save load

* apply suggestions from code review

修复benchmark多机任务异常退出的处理 (PaddlePaddle#9651)

* 修复benchmark多机任务异常退出的处理

* fix bug

* update

Fix LLAMA arg parsing bug in pp (PaddlePaddle#9806)

[Readme] Update mixtral.md (PaddlePaddle#9829)

[XPU] Support empty_cache on XPUs (PaddlePaddle#9789)

* [XPU] Support empty_cache on XPUs

* warn if current device doesn't support

[Inference] Fix multibatch inference (PaddlePaddle#9831)

* fix batch infra

* fix deepseekv2 infra

Fix position_ids for infra  (PaddlePaddle#9841)

fix moe diff due to e_score_correction_bias

fix fast tokenizer

[LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (PaddlePaddle#9827)

* add modleing_pp

* add modleing_pp for qwen2moe

* add flashmask and pp for Qwen2MoE and Deepseek

* remove

* fix fast_tokenizer save

* update for topk_weight of noaux_tc

* fix for flashmask

* add use_expert_parallel for pretrain

* fix tokenizer test

[Mergekit]update & add LoRA merge (PaddlePaddle#9811)

* add

* fix bug

* fix

* add

* add lora merge

* add

* add

* add

* add

* add

* add

[Unified Checkpoint] Fix expert parallel (PaddlePaddle#9821)

* fix expert parallel

* fix split_param for expert parallel

* add filter_sync_parameters

fix import

[Inference] Flask server compatible with OpenAI api. (PaddlePaddle#9828)

* flask server compatible with OpenAI api.

* fix max_length to max_tokens.

* fix with think model.

[LLM] fix checkpoint save for non flash mode (PaddlePaddle#9830)

support mla for speculate

[DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (PaddlePaddle#9769)

* support deepseek-v3

* support head_dim=192,256 for append_attn c16

* update 0113

* attention run

* refine code

* add softmax_scale

* support weight_only_int8

* refine code

* support tp

* delete test_append_attn

* add splited fused_moe from ziyuan

* fix repe for deepseek-v3

* add deepseek-v3 class

* fix wint8 precision and refine code

* fix wint4, big diff

* add e_score_correction_bias

* fix head_dim

* fix v3 verify

* fix d2s

* fix v3 verify

* support qk_head_dim != v_head_dim

* fix wint8 v_head_dim

* fix rope

* fix qwen2

* mla use position_ids only

* remove control flow

* remove gpu concat

* fix norm weight dtype

* remove all_reduce in fused_moe

* fix static run

* refine rope code

* compute position_ids use custom op

* fuse rope

* fix macro

* fix mixtral

* support mla for speculate

* fix tokenizer and qwen

* fix moe diff due to e_score_correction_bias

* fix fast tokenizer

* fix import

---------

Co-authored-by: lizhenyun01 <[email protected]>
Co-authored-by: lizhenyun <[email protected]>

Solve the compatibility problem of type annotation Python version (PaddlePaddle#9853)

mix fp8 and wint8

save extra special tokens (PaddlePaddle#9837)

[Bugfix] Fix dsk rope diff (PaddlePaddle#9859)

* fix dsk diff

* fix

* update

merge develop to check fp8 moe-wint8

fix deepseek v3 fp8 precision

fix deepseek weight quant

[Optimization] Support lower memory cards. (PaddlePaddle#9804)

* support lower memory cards.

* add doc for v100 16G such devices.

* remove debug info.

* add pre divided factor to overcome overfit problem for fp16 attention.

Support XPU for auto-paralllel LLaMa (PaddlePaddle#9796)

* Support XPU for auto-paralllel LLaMa

* Update

* Update

* Update

* Update

* Fix CI errors

* Update

[XPU] Add xpu fused op for deepseek (PaddlePaddle#9854)

[Inference] Update deepseek (PaddlePaddle#9864)

* fix

* fix infra

[PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model (PaddlePaddle#9855)

* git flops with pp model.

* Support hareware tflops for deepseek.

[Inference]Support mtp with deepseek-v3 (PaddlePaddle#9856)

* support mtp with deepseek_v3 both in static and dygraph mode

* fix speculate tokenizer in unittest

* delete useless code

check code
@ckl117 ckl117 mentioned this pull request Feb 17, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants