add single_model network and use intermediate api#9412
Conversation
|
Thanks for your contribution! |
| level = "os_g" | ||
| elif ShardingOption.FULL_SHARD in self.args.sharding: | ||
| level = "p_g_os" | ||
| model, self.optimizer = sharded_data_parallel(model, self.optimizer, level) |
There was a problem hiding this comment.
是否构造dp_config传入parallelize,不需要再单独进行 sharded_data_parallel
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9412 +/- ##
===========================================
- Coverage 53.17% 52.33% -0.85%
===========================================
Files 718 721 +3
Lines 114694 113772 -922
===========================================
- Hits 60990 59540 -1450
- Misses 53704 54232 +528 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3f735c3 to
0632d75
Compare
| return logits | ||
|
|
||
|
|
||
| loss_cnt = 0 |
There was a problem hiding this comment.
遗留代码,已删除。
| @@ -22,9 +22,14 @@ | |||
| import paddle.distributed as dist | |||
There was a problem hiding this comment.
未来的目标是啥,全面迁移auto_trainer?如果是的话,是不是直接所有trainer逻辑都统一重写,不要与旧的耦合
There was a problem hiding this comment.
按之前讨论,自动并行集成到auto_trainer,trainer主要是集成原来手动并行的逻辑,两者不耦合,只是共用基础设施
There was a problem hiding this comment.
主要是公用基础设施,如果公用的api发生变化,可能会导致这里也挂了,从开发体验上来看,当前自动并行的测试监控需要能够及时发现定位这些问题
| f"{prefix}lm_head.weight": ColWiseParallel(), | ||
| } | ||
| }, | ||
| "pp_config": {"split_spec": f"{prefix}llama.layers"}, |
There was a problem hiding this comment.
对于非layers层以及empty层切分方式麻烦提供下示例
There was a problem hiding this comment.
同时有个疑问,share weight的参数预计怎么如何标记了?
There was a problem hiding this comment.
框架会自动识别整理出share weight的参数,做特殊处理,用户在使用时当作正常参数标记即可。
| _keys_to_ignore_on_load_unexpected = [r"self_attn.rotary_emb.inv_freq"] | ||
|
|
||
| @classmethod | ||
| def _get_name_mappings(cls, config: LlamaConfig) -> list[StateDictNameMapping]: |
There was a problem hiding this comment.
在 auto_dist_config 配置了 sp、tp的切分信息,建议去掉name_mapping的配置,在自动并行的中层API里面进行切分
| level = 2 | ||
| if ShardingOption.FULL_SHARD in sharding: | ||
| level = 3 | ||
| final_config["dp_config"] = {"level": level} |
There was a problem hiding this comment.
需要考虑LoRA、DPO、KTO训练
- LoRA训练会自定义tensor parallel的LoRA层,这里如何配置了?
- KTO 和 DPO 有两个模型,一个更新参数,另外不更新参数,两个模型如何配置分布式策略了?
| warnings.warn( | ||
| f"enable_parallel_cross_entropy, the vocab_size should be splited: {prediction_scores.shape[-1]}, {self.config.vocab_size}" | ||
| ) | ||
| self.loss_func = paddle.nn.CrossEntropyLoss(reduction="none", ignore_index=self.ignore_index) |
There was a problem hiding this comment.
在大模型训练中有一些自定义的PyLayer算子,非TP组网中能支持折中PyLayer算子吗?同时折中PyLayer算子和张量并行也是耦合在一起,如何开发了?
| level = 3 | ||
| final_config["dp_config"] = {"level": level} | ||
|
|
||
| return final_config |
There was a problem hiding this comment.
中层API对unified checkpoint的支持情况如何了?是否可以支持自适应的分布式策略的扩展
There was a problem hiding this comment.
当前已支持checkpoint转换为单卡权重,单卡权重到unified checkpoint的转换在支持中
There was a problem hiding this comment.
- Unified checkpoint现在可以支持基本任意的分布式策略切换;
- Unified checkpoint 保存出来的模型权重可以直接应用到推理框架、或者其他的训练流程中。
现在自动并行保存的格式可以同时支持这两点吗?
另外我看是需要把checkpoint转换为单卡权重,再转到unified checkpoint,是否太复杂了。能否直接支持 Unified checkpoint 的保存?
| ) | ||
|
|
||
|
|
||
| class LlamaPretrainingCriterion3DNet(paddle.nn.Layer): |
There was a problem hiding this comment.
中层API设计涉及criterion吗?比如ParallelCrossEntropy?
There was a problem hiding this comment.
支持,在tensor_parallel_config中加上replace_with_parallel_cross_entropy配置即可
|
|
||
|
|
||
| class LlamaPretrainingCriterion3DNet(paddle.nn.Layer): | ||
| """ |
There was a problem hiding this comment.
中层API的设计主要是涉及DP和TP,和PP的场景需要特殊的兼容吗?
There was a problem hiding this comment.
| f"{prefix}llama.layers.*.mlp.up_proj": ColWiseParallel(), | ||
| f"{prefix}llama.layers.*.mlp.gate_up_fused_proj": ColWiseParallel(), | ||
| f"{prefix}llama.layers.*.mlp.down_proj": RowWiseParallel(), | ||
| f"{prefix}lm_head.weight": ColWiseParallel(), |
There was a problem hiding this comment.
ColWiseParallel()后原先的linear会转化为ColumnParallelLinear或ColumnSequenceParallelLinear,还是另一个新的linear类型?
There was a problem hiding this comment.
ColWiseParallel会对指定的linear的权重用基础api重写一遍,在后续运行的时候自动并行会自动推出来分布式状态同时插入通信算子。不会改变linear类型。
| config.use_recompute = training_args.recompute | ||
| config.tensor_parallel_degree = training_args.tensor_parallel_degree | ||
| config.tensor_parallel_rank = training_args.tensor_parallel_rank | ||
| config.sharding_parallel_degree = training_args.sharding_parallel_degree |
There was a problem hiding this comment.
看了一下代码,为啥在400多行创建Topology的位置,sharding_degree默认设置为1?
There was a problem hiding this comment.
auto_parallel's sharding is not orthogonal with dp, mp and pp
There was a problem hiding this comment.
dp_degree已经包含了sharding_degree,所以sharding_degree设为1即可。
f1f4e46 to
612237d
Compare
5e24d14 to
f35407b
Compare
1720943 to
2ebb3dc
Compare
| normalized_shape=normalized_shape, epsilon=epsilon, weight_attr=weight_attr, bias_attr=bias_attr | ||
| ) | ||
| self.config = config | ||
| self.ipp = ipp |
There was a problem hiding this comment.
基础组网需要传入ipp,表示该layer在pipeline stage中的位置。
There was a problem hiding this comment.
可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?
| flash_attention = None | ||
|
|
||
| __all__ = [ | ||
| "LlamaForCausalLM3DNet", |
There was a problem hiding this comment.
| "LlamaForCausalLM3DNet", | |
| "LlamaForCausalLM3DNet", |
为什么要叫 3DNet,为什么要加特殊的 3D 前缀?
There was a problem hiding this comment.
因为基础api组网中带了3D前缀,表示该网络支持pp,dp,tp 3d混合并行,故此保留
There was a problem hiding this comment.
单卡和3D的组网不一样吗?如果一样是不是可以直接去掉3D的前缀?
|
|
||
|
|
||
| class LlamaMLPNet(nn.Layer): | ||
| def __init__(self, config, ipp: Optional[int] = None): |
There was a problem hiding this comment.
遗留代码,已删除,done
| # output = (logits,) + outputs[1:] | ||
| # return (loss,) + output if loss is not None else output | ||
|
|
||
| # return CausalLMOutputWithCrossAttentions( |
There was a problem hiding this comment.
这些不支持吗? 可以支持 model.generate 生成吗?
There was a problem hiding this comment.
动转静要求损失函数和模型分离,故此注释。
| # ) | ||
|
|
||
| def auto_dist_config(self, prefix=""): | ||
| if prefix != "": |
There was a problem hiding this comment.
| ) | ||
|
|
||
| def merge_auto_dist_configs(self, configs): | ||
| """ |
There was a problem hiding this comment.
这个函数必须放到 模型 基类里面吗?可以放到 autotrainer 吗?
There was a problem hiding this comment.
因为可能会有如下场景,模型A包含模型B和模型C。模型A,模型B和模型C都有各自的分布式配置,所以放在模型的基类中便于处理,能方便的找到各个模型自己的分布式配置,在auto trainer中merge最终得到模型A的分布式配置。
|
|
||
| return final_config | ||
|
|
||
| def _generate_auto_dist_config(self, auto_dist_degree): |
| attention_mask = paddle.where(attention_mask, zero, neg_inf) | ||
| attention_mask = dist.shard_tensor(attention_mask, get_mesh(), [dist.Replicate(), dist.Replicate()]) | ||
| hidden_states = self.drop(hidden_states) | ||
| hidden_states = dist.reshard(hidden_states, get_mesh(), [dist.Shard(0), dist.Replicate()]) |
There was a problem hiding this comment.
modeling_3D_auto.py
modeling_auto.py
要不统一一下,我看不同模型不同写法
ca32e1d to
f00161a
Compare
| normalized_shape=normalized_shape, epsilon=epsilon, weight_attr=weight_attr, bias_attr=bias_attr | ||
| ) | ||
| self.config = config | ||
| self.ipp = ipp |
There was a problem hiding this comment.
可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?
| "pp_config": None, | ||
| } | ||
| for name, layer in self.named_sublayers(include_self=True): | ||
| if hasattr(layer, "auto_dist_config"): |
There was a problem hiding this comment.
问题一:基础组网暂时无法避免。因为每一层layer都需要知道自己在pipeline stage中的位置。
问题二:不需要。只是为了处理模型A包含模型B和模型C的场景。此时模型B和模型C属于模型A的sublayer且都有自己的auto_dist_config
There was a problem hiding this comment.
补充下我的理解:
“可以避免这样传参数吗?每一个层都要接一个这样的参数,很麻烦。这个不能自动做吗?”
使用自动并行基础API的组网(modeling_auto.py)需要ipp参数,和之前一样,这种组网在自动并行中层API成熟之后会逐步退场。
而使用自动并行中层API的组网(modeling_network.py)即单卡组网就不需要ipp参数了,从代码中也可以看出来,是未来建议使用的方式
| mem=-1 | ||
| echo "result: loss=$loss ips=$ips mem=$mem loss_md5=$loss_md5" | ||
| loss_base=10.59486389 # output of dropout is different after supporting spmd | ||
| loss_base=10.55848312 # output of dropout is different after supporting spmd |
There was a problem hiding this comment.
gpt初始化权重改变了
| flash_attention = None | ||
|
|
||
| __all__ = [ | ||
| "LlamaForCausalLM3DNet", |
There was a problem hiding this comment.
单卡和3D的组网不一样吗?如果一样是不是可以直接去掉3D的前缀?
| """ | ||
| Merged all auto dist configs into one config. | ||
| """ | ||
| assert isinstance(configs, (dict, list)) |
| final_config["sp_config"] = config["sp_config"] | ||
| else: | ||
| for k, v in config["sp_config"]["parallelize_plan"].items(): | ||
| assert k not in final_config["sp_config"]["parallelize_plan"].keys() |
| "sp_config": None, | ||
| "pp_config": None, | ||
| } | ||
| for config in configs: |
| assert model is not None | ||
| assert isinstance(model, PretrainedModel) |
| # # up | ||
| # a1 = self.w1(hidden_states) | ||
| # # gate | ||
| # a2 = self.w2(hidden_states) | ||
| # intermediate_parallel = a1 * F.silu(a2) | ||
| # down |
| # export PYTHONPATH=../../../:$PYTHONPATH | ||
|
|
||
| python -u -m paddle.distributed.launch \ | ||
| --gpus "4,5,6,7" \ |
| @@ -0,0 +1,113 @@ | |||
| # Copyright (c) 2024 PaddlePaddle Authors. All Rights Reserved. | |||
There was a problem hiding this comment.
这里依赖的paddle版本是否给出明确的版本信息,可以提现在readme文档上
| f"{prefix}lm_head.weight": dist.ColWiseParallel(), | ||
| } | ||
| }, | ||
| "pp_config": {"split_spec": f"{prefix}llama.layers", "global_spec": "llama.global_layer"}, |
There was a problem hiding this comment.
现在可以支持share weight的pipeline方式吗?
| if prefix != "": | ||
| assert prefix.endswith(".") | ||
| config = { | ||
| "sp_config": { |
There was a problem hiding this comment.
这里的配置有点疑问,sequence parallel虽然要依赖tensor parallel,但这里的配置和 tp config大部分是重复,是否可以减少点重复配置?
There was a problem hiding this comment.
可以在后续优化
|
| @@ -38,13 +38,16 @@ | |||
| CosineAnnealingWithWarmupDecay, | |||
| GPTConfig, | |||
| GPTForCausalLMAuto, | |||
There was a problem hiding this comment.
是否可以考虑统一一个run_pretrain_auto.py,不然每个模型维护一个脚本,维护成本比较大。
There was a problem hiding this comment.
考率到修改启动脚本涉及到的ci/ce脚本较多。故计划此pr合入之后提交个pr统一修改
| def _wrap_for_auto(self, model, train_dataloader): | ||
| logger.info("Wrapping model for auto paralle") | ||
| logger.info(f"Wrapping model for auto parallel using intermediate api {self.args.use_intermediate_api} ") | ||
| dist_loader = self._wrap_for_dist_loader(train_dataloader) |
There was a problem hiding this comment.
这块dist_loader和我们目前PaddleNLP的 distributed_dataloader,区别大吗?看起来功能差得有点多
There was a problem hiding this comment.
区别不大,只是包了一层dataloader。对dataloader其余的功能支持还在陆续开发中。coming soon
1.sft/dpo/ppo的支持还在开发验证当中。当前pr只考虑了预训练的场景。 |
|
|
Unified checkpoint现在可以支持基本任意的分布式策略切换; 现在自动并行保存的格式可以同时支持这两点吗? |
这个还在开发中呢。 |
DPO/KTO主要需要考虑的是需要支持Criterion这块是否可以更加灵活的支持,而不是限制需要写在组网里 |



PR types
New features
PR changes
Models
Description
添加llama ,qwen,gpt 单卡组网。
auto_trainer支持使用中层api & 运行脚本支持使用中层api
验证文档:https://ku.baidu-int.com/knowledge/HFVrC7hq1Q/pKzJfZczuc/ESWJRriQZ-/sfEV74J-hHGXIR