[INFER][LLM] Add the AutoModel for inference mode#9416
Conversation
|
Thanks for your contribution! |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #9416 +/- ##
===========================================
- Coverage 53.01% 52.81% -0.20%
===========================================
Files 678 676 -2
Lines 108787 107910 -877
===========================================
- Hits 57668 56997 -671
+ Misses 51119 50913 -206 ☔ View full report in Codecov by Sentry. |
| return model_class.get_cache_kvs_shape( | ||
| config, predictor_args.batch_size, predictor_args.total_max_length | ||
| ) |
There was a problem hiding this comment.
这个地方不要这样调用,from_pretrained的语义是返回一个Model,所以把get_cache_kvs_shape放在外面from_pretrained返回之后调用吧
| model = AutoModelForCausalLM.from_pretrained( | ||
| predictor_args.model_name_or_path, | ||
| inference_mode=True, | ||
| config=config, | ||
| predictor_args=predictor_args, | ||
| model_args=model_args, | ||
| dtype=predictor_args.dtype, | ||
| tensor_parallel_degree=tensor_parallel_degree, | ||
| tensor_parallel_rank=tensor_parallel_rank, | ||
| ) |
There was a problem hiding this comment.
不复用AutoModelForCausalLM了吧,新增一个AutoInferenceModelForCausalLM,这样就不需要传inference_mode=True了,后续也更独立些
| tensor_parallel_degree = kwargs.pop("tensor_parallel_degree", 1) | ||
| tensor_parallel_rank = kwargs.pop("tensor_parallel_rank", 0) | ||
| model_arg = kwargs.pop("model_args", None) | ||
| static_mode = predictor_args.mode == "static" |
There was a problem hiding this comment.
直接使用predictor_args.mode == "static"来作为判断条件,不另声明一个新变量,否则感觉有些不清晰~
There was a problem hiding this comment.
下面的dynamic_mode = predictor_args.mode == "dynamic"同理
| new_model_class = model_class.set_inference_config( | ||
| config=config, | ||
| predictor_args=predictor_args, | ||
| tensor_parallel_degree=tensor_parallel_degree, | ||
| tensor_parallel_rank=tensor_parallel_rank, | ||
| ) | ||
| # detect the cpu avx or xpu | ||
| if new_model_class is not None: | ||
| model_class = getattr(import_class, f"{new_model_class}InferenceModel") | ||
| model_class.set_inference_config( | ||
| config=config, | ||
| predictor_args=predictor_args, | ||
| tensor_parallel_degree=tensor_parallel_degree, | ||
| tensor_parallel_rank=tensor_parallel_rank, | ||
| ) |
There was a problem hiding this comment.
感觉这块有些奇怪,set_inference_config可能会返回一个新的model_class,然后再调用一次set_inference_config,行为不太合理,可以看下这里有没有更流畅的方式~

PR types
New features
PR changes
Others
Description
当前PaddleNLP内加载Inference Model采用的是if-else分支进行模型选择,本PR参考散网中的AutoModelForCausalLM进行AutoInferenceModelForCausalLM的实现,使其支持了Inference Model的加载。
Inference Model通过AutoInferenceModelForCausalLM加载的流程图如下:
如果对不同的Inference Model的Inference Config需要进行不同的配置,只需要在对应Inference Model的类中,重写set_inference_config这个类方法即可。对于同一个模型,不同的执行设备有不同的Inference Model,只需要重写confirm_inference_model,将替换逻辑加入即可。
比如下图中对
LlamaForCausalLMInferenceModel,LlamaForCausalLMBlockInferenceModel,LlamaForCausalLMAvxInferenceModel三个不同的InferenceModel有不同的Inference参数只用重写对应类的方法。