[AutoParallel] add sharding tensor_fusion save load switch by AndSonder · Pull Request #9810 · PaddlePaddle/PaddleNLP

AndSonder · 2025-01-22T09:55:18Z

Before submitting

Lint code. If there are lint issues, please format the code first.

# Install and register `pre-commit` in the project folder
pip install pre-commit && pre-commit install

# Process previous code files separately
pre-commit run --file XXXX.py

Add test cases into tests folder. If there are codecov issues, please add tests cases first.

PR types

PR changes

Others

Description

我们需要支持：开 tensor_fusion 的情况下既可以从均匀的参数 load、也可以从非均匀的参数 load，存的时候既可以存均匀的参数，也可以存非均匀的参数

为了满足上述需求，需要把开关从 state_dict 和 set_state_dict 里面剥离出来，放在 PaddleNLP 端
paddlenlp 端新加参数，save_model_with_sharding_tensor_fusion 和 load_model_with_sharding_tensor_fusion

当使用 tensor_fusion 的时候：

save_model_with_sharding_tensor_fusion 为 True 表示 Save 非均衡模型，为 False 表示 Save 均衡模型
load_model_with_sharding_tensor_fusion 为 True 表示 Load 非均衡模型，为 False 表示 Load 均衡模型

Codecov Report

Attention: Patch coverage is 21.05263% with 30 lines in your changes missing coverage. Please review.

Project coverage is 52.21%. Comparing base (3967f76) to head (fee929a).
Report is 331 commits behind head on develop.

Files with missing lines	Patch %	Lines
paddlenlp/trainer/auto_trainer.py	0.00%	29 Missing ⚠️
paddlenlp/trainer/auto_training_args.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           develop    #9810      +/-   ##
===========================================
- Coverage    52.43%   52.21%   -0.22%     
===========================================
  Files          731      730       -1     
  Lines       116411   115828     -583     
===========================================
- Hits         61037    60482     -555     
+ Misses       55374    55346      -28

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

From00 · 2025-01-23T03:11:00Z

    )
+
+    load_model_with_sharding_tensor_fusion: bool = field(
+        default=True,


这个开关是否默认为False更合理一些？

这个开关是否默认为False更合理一些？

True 表示用非均衡的 save&load 方式，不会引入显存增加，默认用 True 会好一些

From00 · 2025-01-23T03:15:58Z

+
+    @property
+    def should_load_sharding_tensor_fusion_balanced_model(self):
+        if not self.enable_auto_parallel:


这条判断是不是和下边的合并就好了？

这条判断是不是和下边的合并就好了？

是的，下一个 commit 修改

Done 37b510c

From00 · 2025-01-23T03:17:55Z

        logger.debug("")
+
+    @property
+    def should_load_sharding_tensor_fusion_balanced_model(self):


这个命名是否和暴露给用户的开关load_model_with_sharding_tensor_fusion保持一致好一些？

这个命名是否和暴露给用户的开关load_model_with_sharding_tensor_fusion保持一致好一些？

如果只用 should_load_model_with_sharding_tensor_fusion 没办法区分开 load 均衡/非均衡的

should_load_model_with_sharding_tensor_fusion 为 False 也有可能是没开 tensor_fusion, 没开 tensor_fusion 的时候不需要 convert

只有开了 tensor_fusion 才需要考虑是否需要做 state_dict 的 convert，目前这样写法感觉清晰一些，这个属性不对用户暴露

if self.args.should_load_sharding_tensor_fusion_unbalanced_model: xxx if self.args.should_load_sharding_tensor_fusion_balanced_model: xxx

From00

LGTM

ZHUI · 2025-01-24T06:45:11Z

        },
    )
+
+    load_model_with_sharding_tensor_fusion: bool = field(


原则上，专用的参数建议放到 auto training args 里面。一定要放到 paddlenlp/trainer/training_args.py 基类里面的话，参数命名加上 auto parallel之类的前缀。现在这样非常容易跟自动并行搞混掉。

原则上，专用的参数建议放到 auto training args 里面。一定要放到 paddlenlp/trainer/training_args.py 基类里面的话，参数命名加上 auto parallel之类的前缀。现在这样非常容易跟自动并行搞混掉。

Done 0d72fea

update 0113 support head_dim=192,256 for append_attn c16 attention run refine code add softmax_scale support weight_only_int8 refine code support tp delete test_append_attn add splited fused_moe from ziyuan add deepseek-v3 class fix repe for deepseek-v3 fix wint8 precision and refine code fix wint4, big diff add e_score_correction_bias fix head_dim fix v3 verify [AutoParallel] open tensor_fusion for benchmark (PaddlePaddle#9749) * open tensor_fusion for benchmark fix loraga merge (PaddlePaddle#9765) * fix loraga merge * change sign Fix ernie ci auto trainer error (PaddlePaddle#9758) * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * [AutoParallel]:fix ernine auto_trainer error * Update run_pretrain_auto.py Update README.md (PaddlePaddle#9766) * Update README.md [BugFix] Fix matryoshka norm loss (PaddlePaddle#9774) * fix matryoshka norm [Distributed] support fuse optimizer (PaddlePaddle#9519) (PaddlePaddle#9777) Update register_sequence_parallel_allreduce_hooks (PaddlePaddle#9782) * fix sequence parallel * update register_sequence_parallel_allreduce_hooks * update fuse_sequence_parallel_allreduce Fix ce error (PaddlePaddle#9783) * [AutoParallel]:fix ci error * [AutoParallel]:fix ci error fix (PaddlePaddle#9779) [MoE] fix expert parallel (PaddlePaddle#9760) * fix moe uc fix dpo pp criterion (PaddlePaddle#9786) [Infer] Add pir_model path for server infer. (PaddlePaddle#9790) fix d2s fix v3 verify support qk_head_dim != v_head_dim support fp8 batch gemm on cutlass3.x upgrade cutlass version for block_wise fp8 gemm change cutlass commit to ckl117 group_wise branch support fp8 block gemm, but private cutlass commit, and TODO: update fp8 dual gemm api on cutlass3.x support auto tune fp8 block gemm code update cutlass to v3.7.0, todo: support block gemm based on v3.7.0 support block gemm on cutlass v3.7.0 commit code check code check check dynamic_quant ad block builder dir rename group_quant fix wint8 v_head_dim fix rope fix qwen2 mla use position_ids only remove control flow remove gpu concat fix norm weight dtype remove all_reduce in fused_moe part support fp8 check group_quant and fake fp8 check support block gemm [LLM] support flash device on static model (PaddlePaddle#9619) (PaddlePaddle#9787) * [LLM] support flash device on static model * [LLM] adapt pdc sdk [LLM Benchmark]update scripts (PaddlePaddle#9722) * add no_proxy & del paddlenlp_ops * update timeout for dpo * fix sequence_parallel * add timeout * add Total_Tokens_per_second_per_gpu * fix Tokens_per_second_per_gpu * update Total_Tokens_per_second_per_gpu mergekit gpu 1226 (PaddlePaddle#9702) * mergekit gpu 1226 * merge model gpu * merge gpu * add lora model * change valueerror * add lora * gpu test [LLM] merge code from fastdeploy (PaddlePaddle#9791) * [LLM] update llm server dockerfiles * merge code from fastdeploy [Inference] Support eagle for llama (PaddlePaddle#9812) [CI] Fix ci of small models (PaddlePaddle#9633) [Trainer] Wrap model when lora is ON and only do evaluation. (PaddlePaddle#9803) [README] Update README.md for documention (PaddlePaddle#9785) * Update README.md * Update README.md * Update README_en.md fix static run wint8 and fake-fp8, todo: support data type does not match support fp8, but ffn1 and moe in wint8 support ffn1 fp8 block gemm done ffn1 fp8 block gemm block gemm done block gemm support batch refine rope code compute position_ids use custom op fix split_param (PaddlePaddle#9817) [LLM] Update model convert and fix TP for deepseekv3 (PaddlePaddle#9797) * fix model convert and tp in MoEMLP * fix tp_action filter * update convert accoding to num_nextn_predict_layers * add deepseek-R1 fuse rope fix macro fix mixtral set_state_dict block_wise weight support fp8 per tensor network, no support scale Tensor for tensor gemm deepseek-v3 fp8 tensor gemm network, but precision fault add triton fp8 fused_moe kernel fix moe triton kernel add moe triton kernel fix fix fp8 block gemm precision moe triton fp8 network support moe triton and precision correct, but shared ffn1 ffn2 incorrect fp8 block network, no check shared ffn1-ffn2 in v2-lite delete wint8 in fake delete some useless code and verify per tensor net with in qkv outlinear ffn1 ffn2, but triton moe don't match api fp8 block quant when load model, and code check fix tokenizer and qwen [AutoParallel] add sharding tensor_fusion save load switch (PaddlePaddle#9810) * support tensor_fusion save load * apply suggestions from code review 修复benchmark多机任务异常退出的处理 (PaddlePaddle#9651) * 修复benchmark多机任务异常退出的处理 * fix bug * update Fix LLAMA arg parsing bug in pp (PaddlePaddle#9806) [Readme] Update mixtral.md (PaddlePaddle#9829) [XPU] Support empty_cache on XPUs (PaddlePaddle#9789) * [XPU] Support empty_cache on XPUs * warn if current device doesn't support [Inference] Fix multibatch inference (PaddlePaddle#9831) * fix batch infra * fix deepseekv2 infra Fix position_ids for infra (PaddlePaddle#9841) fix moe diff due to e_score_correction_bias fix fast tokenizer [LLM] Add pipeline and flashmask for Qwen2Moe and Deepseek (PaddlePaddle#9827) * add modleing_pp * add modleing_pp for qwen2moe * add flashmask and pp for Qwen2MoE and Deepseek * remove * fix fast_tokenizer save * update for topk_weight of noaux_tc * fix for flashmask * add use_expert_parallel for pretrain * fix tokenizer test [Mergekit]update & add LoRA merge (PaddlePaddle#9811) * add * fix bug * fix * add * add lora merge * add * add * add * add * add * add [Unified Checkpoint] Fix expert parallel (PaddlePaddle#9821) * fix expert parallel * fix split_param for expert parallel * add filter_sync_parameters fix import [Inference] Flask server compatible with OpenAI api. (PaddlePaddle#9828) * flask server compatible with OpenAI api. * fix max_length to max_tokens. * fix with think model. [LLM] fix checkpoint save for non flash mode (PaddlePaddle#9830) support mla for speculate [DSK] support deepseek-v3/r1 (mha/fp16/bf16/wint8/wint4) (PaddlePaddle#9769) * support deepseek-v3 * support head_dim=192,256 for append_attn c16 * update 0113 * attention run * refine code * add softmax_scale * support weight_only_int8 * refine code * support tp * delete test_append_attn * add splited fused_moe from ziyuan * fix repe for deepseek-v3 * add deepseek-v3 class * fix wint8 precision and refine code * fix wint4, big diff * add e_score_correction_bias * fix head_dim * fix v3 verify * fix d2s * fix v3 verify * support qk_head_dim != v_head_dim * fix wint8 v_head_dim * fix rope * fix qwen2 * mla use position_ids only * remove control flow * remove gpu concat * fix norm weight dtype * remove all_reduce in fused_moe * fix static run * refine rope code * compute position_ids use custom op * fuse rope * fix macro * fix mixtral * support mla for speculate * fix tokenizer and qwen * fix moe diff due to e_score_correction_bias * fix fast tokenizer * fix import --------- Co-authored-by: lizhenyun01 <[email protected]> Co-authored-by: lizhenyun <[email protected]> Solve the compatibility problem of type annotation Python version (PaddlePaddle#9853) mix fp8 and wint8 save extra special tokens (PaddlePaddle#9837) [Bugfix] Fix dsk rope diff (PaddlePaddle#9859) * fix dsk diff * fix * update merge develop to check fp8 moe-wint8 fix deepseek v3 fp8 precision fix deepseek weight quant [Optimization] Support lower memory cards. (PaddlePaddle#9804) * support lower memory cards. * add doc for v100 16G such devices. * remove debug info. * add pre divided factor to overcome overfit problem for fp16 attention. Support XPU for auto-paralllel LLaMa (PaddlePaddle#9796) * Support XPU for auto-paralllel LLaMa * Update * Update * Update * Update * Fix CI errors * Update [XPU] Add xpu fused op for deepseek (PaddlePaddle#9854) [Inference] Update deepseek (PaddlePaddle#9864) * fix * fix infra [PreTrain] Support deepseek mfu for pretraining and fix tflops for pretrain pipe model (PaddlePaddle#9855) * git flops with pp model. * Support hareware tflops for deepseek. [Inference]Support mtp with deepseek-v3 (PaddlePaddle#9856) * support mtp with deepseek_v3 both in static and dygraph mode * fix speculate tokenizer in unittest * delete useless code check code

support tensor_fusion save load

2b1d164

AndSonder mentioned this pull request Jan 22, 2025

[AutoParallel] add tensor fusion convert func PaddlePaddle/Paddle#70869

Merged

From00 reviewed Jan 23, 2025

View reviewed changes

AndSonder and others added 3 commits January 23, 2025 11:35

apply suggestions from code review

37b510c

apply suggestions from code review

de97437

Merge branch 'PaddlePaddle:develop' into tensor_fusion_save_load

b14a802

From00 previously approved these changes Jan 24, 2025

View reviewed changes

ZHUI reviewed Jan 24, 2025

View reviewed changes

apply suggestions from code review

0d72fea

AndSonder dismissed From00’s stale review via 0d72fea January 24, 2025 06:57

Merge branch 'PaddlePaddle:develop' into tensor_fusion_save_load

fee929a

AndSonder requested a review from ZHUI January 25, 2025 12:39

ZHUI approved these changes Feb 5, 2025

View reviewed changes

ZHUI merged commit 54b8882 into PaddlePaddle:develop Feb 5, 2025

ckl117 mentioned this pull request Feb 17, 2025

support deepseek-v3 #9878

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AutoParallel] add sharding tensor_fusion save load switch#9810

[AutoParallel] add sharding tensor_fusion save load switch#9810
ZHUI merged 6 commits into
PaddlePaddle:developfrom
AndSonder:tensor_fusion_save_load

AndSonder commented Jan 22, 2025

Uh oh!

codecov Bot commented Jan 22, 2025 •

edited

Loading

Uh oh!

From00 Jan 23, 2025

Uh oh!

AndSonder Jan 23, 2025

Uh oh!

From00 Jan 23, 2025

Uh oh!

AndSonder Jan 23, 2025 •

edited

Loading

Uh oh!

From00 Jan 23, 2025

Uh oh!

AndSonder Jan 23, 2025 •

edited

Loading

Uh oh!

From00 left a comment

Uh oh!

ZHUI Jan 24, 2025

Uh oh!

AndSonder Jan 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AndSonder commented Jan 22, 2025

Before submitting

PR types

PR changes

Description

Uh oh!

codecov Bot commented Jan 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

From00 Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

AndSonder Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

From00 Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

AndSonder Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

From00 Jan 23, 2025

Choose a reason for hiding this comment

Uh oh!

AndSonder Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

From00 left a comment

Choose a reason for hiding this comment

Uh oh!

ZHUI Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

AndSonder Jan 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented Jan 22, 2025 •

edited

Loading

AndSonder Jan 23, 2025 •

edited

Loading

AndSonder Jan 23, 2025 •

edited

Loading